Table of Contents
<meta name="searchbox-xxxx"> tags in HTML documentsIn this chapter the basic searchbox Engine administration procedure are described.
All basic administration procedures must be performed from command line interface by system administrator with administrator permissions. The main searchboxd executable must be invoked with some switches in order to perform various actions. Such procedure is slightly different from Windows to Unix boxes.
Under Windows the searchboxd process runs as a system service so must be invoked by the specific Windows Service Manager.
Under Unixes searchboxd is a command that can be
invoked from a standard terminal. Under Linux Red Hat for instance the
searchboxd command can be found in:
/opt/focuseek/bin/ on Linuxes
/Applications/Searchbox/bin/ on Mac OS
X
Sometimes you want to copy just the configuration among to searchbox installations. To do this you may export the configuration and then import it back on the other system.
The import operations implies a reset.
This is an uninterruptible operation.
The operations required to perform an export and/or an import differ depending on the operating system searchbox runs on.
To perform an export, open Windows Service Manager and open the property sheet for Searchbox Engine service. Stop the service if it is running by pressing the Stop button. Enter /export dir in the Start parameters text field and press the Start button. The searchbox service will stop automatically after the export is finished.
dir is a directory that will contain one file
for each source, archive, collection, watch, plugin etc. configured on
the system. The directory will be created by searchboxd and must not
exist before the export.
To perform an import, open Windows Service Manager and open the property sheet for Searchbox Engine service. Stop the service if it is running by pressing the Stop button. Enter /import dir in the Start parameters text field and press the Start button. The searchbox service will stop automatically after the import is finished.
dir is a directory created by a previous
export between double quotes if the path contains spaces.
To perform an export first stop searchboxd, then run
# searchboxd --export=dir
Where dir is a directory that will contain
one file for each source, archive, collection, watch, plugin etc.
configured on the system. The directory will be created by searchboxd
and must not exist before the export.
To perform an import first stop searchboxd, then run
# searchboxd --import=dir
Where dir is a directory created by a
previous --export.
On some rare occasions (basically after a system crash or to migrate to a newer version of searchbox) you may want to rebuild searchbox indexes based on the documents stored in searchbox.
Indexes for archives with DocumentCache =
NOCACHE cannot be regenerated in this way. The will be empty
after a reindex operation
Depending on the amount of data the reindex operation can be really time consuming.
This is an uninterruptible operation.
The operations required to perform a reindex differ depending on the operating system searchbox runs on.
To perform a reindex, open Windows Service Manager and open the property sheet for Searchbox Server service. Stop the service if it is running by pressing the Stop button. Enter /reindex in the Start parameters text field and press the Start button. The searchbox service will stop automatically after the reindex is finished.
The reset operation lets you remove data and indexes from searchbox while retaining your configuration i.e. sources, archives, collections and watches, as well as plugin configurations, users and so on.
Obviously all the documents and the indexes will be lost.
This is an uninterruptible operation.
The operations required to perform a reset differ depending on the operating system searchbox runs on.
To perform a reset, open Windows Service Manager and open the property sheet for Searchbox Server service. Stop the service if it is running by pressing the Stop button. Enter /reset in the Start parameters text field and press the Start button. The searchbox service will stop automatically after the reset is finished.
searchbox is an "always on system" that means that it usually performs all its internal optimization issues concurrently with its other normal service operations. ???
To perform an optimization, open Windows Service Manager and open the property sheet for Searchbox Server service. Stop the service if it is running by pressing the Stop button. Enter /optimize in the Start parameters text field and press the Start button. The searchbox service will stop automatically after the optimization is finished.
Both on Windows and Unixes some information about searchbox status can be obtained from the menu choice of Control Panel.
The Current version parameter is the version
of the searchbox Engine the Control Panel is currently logged on. This
number is formatted as
major.minor[.revision[.build]]
The Available update parameter warn if a new
version of the Engine is available for the download. This parameter is
effective only if the automated check of new releases is activated in the
focuseek.cfg configuration file.
The Status parameter is the current status of
the searchbox Engine. It can be:
running. The searchbox Engine is currently
running.
crawlstopped. The searchbox Engine has the
crawling activity stopped due to global configuration setting. For
more information see focuseek.cfg configuration
file.
lowdiskspace. The crawling activity is
currently stopped because the volume where the datadir is located has
not enough space to guarantee the searchbox Engine works well. See
MINADDFREEDISK and
MINFREEDISK parameters in the
focuseek.cfg configuration file.
Other information are about the Server load and the Activation key currently used by the searchbox Engine.
| Key | Value |
|---|---|
| number of documents | the number of documents stored in all searchbox archives |
| size of documents (MB) | the size of documents stored in all searchbox archives |
| number of archives | the number of archives |
| free disk spaces (MB) | the free disk space available in the volume where the datadir is located |
| OS type | the operating system compatible with the current activation key |
| platform version | the searchbox Engine version compatible with the current activation key |
| expiration time | when the current activation key expires |
| maximum number of documents | the maximum number of documents allowed by the current activation key |
| maximum size of documents (MB) | the maximum size of documents allowed by the current activation key |
| maximum number of archives | the maximum number of archives allowed by the current activation key |
| maximum number of users | the maximum number of users allowed by the current activation key |
| custom plugin type | Yes, if the current activation key supports installation of custom plugins |
| custom plugins | Yes, if the current activation key supports configuration of custom plugins |
| historicization | Yes, if the historicization feature is enabled |
| API | Yes, if the SOAP API is available |
Other more detailed information about status can be obtained under Unixes running the following command:
# searchboxd --status
Possible outputs are:
searchboxd not running:
searchbox is not running
searchboxd(<pid>)
starting: searchbox is starting; this is only a
transitory state; soon it will be “running”[21]
searchboxd(<pid>)
running: searchbox is running[21]
searchboxd(<pid>)
stopping: searchbox is stopping; this is only a
transitory state; soon it will be “not running”[21]
searchboxd(<pid>) uninterruptible
operation: searchbox is performing some
uninterruptible and possibly time consuming operation[1] such as
rebuilding the whole index. When running such operations searchboxd
cannot be stopped but instead stops on its own when the operation is
finished[2]. Uninterruptible operations must be explicitly stared
running searchboxd with some special switch; see later.
Under Windows equivalent information can be obtained from the Services Administrative Tool.
While we have extensively tested searchbox in a variety of conditions, software is never perfect and 100% bug free. We appreciate and encourage your feedback on searchbox issues, but we'd like to also suggest a few tips that might help you get back up to speed faster. Should searchbox crash, you can try the following corrective actions, in order of preference:
Try stopping and restarting the platform. This is the simplest corrective action and could resolve temporarily problems like component crashes due to, for example, very low memory situations. This action should not cause any data loss.
On Unix systems if searchboxd crashes or the
power is lost then a file might be left around preventing correct
searchboxd start/stop. If you are absolutely sure
that searchboxd is not running[22] then look for the
/opt/searchbox/data/runpid and, if it's there and
for the second time you are absolutely certain that
searchboxd is not running, remove it.
If you suspect that the index might be corrupt because e.g. You get wrong query results or searchbox crashes while performing queries but other operations are correctly accomplished you may try and rebuild the index; see “Forcing index to be rebuild”. Be aware that rebuilding the index from scratch might be very time consuming.
Stop the platform and try to export and then import the configuration; see above in system-specific sections to know how. Then, restart the platform. You will lose all the indexes and all the cached documents, but you will retain your configuration data.
All data generated by your searchbox installation are contained in
the so called datadir. Such
datadir is a folder created during the first launch
of the searchbox in the default location as specified in the
searchbox.cfg (see "Program and data location" in the
next section) configuration file. The datadir
contains the physical state of your installed searchbox so it is possible
to have multiple datadirs and switch from one to the other simply shutting
down searchbox and changing the DATADIR parameter
in the focuseek.cfg configuration file.
searchbox can manage only one datadir at the same time
In case the datadir comes from another searchbox installation on a different server you must be sure to set the file permissions as those generated by the default datadir (permissions are platform dependent).
<meta name="searchbox-xxxx"> tags in HTML documentsSome other global searchbox parameters are stored in a global
configuration file, called focuseek.cfg, that is
usually placed in the config directory inside the program installation
directory, that is c:\Program
Files\focuseek\searchbox\config\focuseek.cfg on windows,
/opt/searchbox/lib/platform/config/searchbox.cfg on
linux and
/Applications/searchbox/lib/platform/searchbox.cfg on
osx. On linux and osx it's location can be changed editing the
searchbox_env file, located under /opt/searchbox/etc/
on linux and under
/Applications/searchbox/etc/searchbox.cfg on
osx.
Configuration file parameters affect the whole searchbox platform and are only used on startup. You can change the parameters at any time but you'll have to restart searchbox to enable them.
The following sections document the configuration file options. Not all options are documented, the undocumented options are related to internal platform coordination or to experimental features and shouldn't be changed.
The User and Group configuration parameters determine the process identity, these are Unix-only options. searchbox needs to be run as root on startup, but for security purposes it will switch to the specified user and group after the initial startup phase.
On Windows you configure the process identity through the Service Control Manager.
DEPLOYDIR and
DATADIR point to the program installation
location and the data file location. You don't usually need to change
these, as they are set by the installation application.
searchbox records that it's running in a file called the pidfile
whose location is specified in the PIDFILE
parameter. It is set to /var/run/searchbox.pid and
you should not change it unless you are fiddling with system
startup/shutdown.
WORKERS is number of processes that
concurrently perform the gathering/parsing/rendering
chain. The default value is 5 but can
be modified according with the size of your pool of sources and the
available amount of RAM and CPU.
These directives let you control how to access searchbox from external applications:
HTTP_PORT - This parameter has the form
address:port, searchbox integrated web server
will listen on the specified IP address and port. The specified IP
address should be associated to a network interface card in the
system. A special value of * can be used as an IP address to have
searchbox listen on all network interface cards. The default port is
2200. For example 127.0.0.1:2200 opens
searchbox only to applications running on the local computer, while
*:2200 opens searchbox to requests from any
network interface card.
HTTP_USER - This parameter holds the
builtin user name. This user has administrator rights over all the
searchbox objects.
HTTP_MD5PASS - This parameter holds the
MD5 hash of the password for the user named after
HTTP_USER.
If you're running searchbox from behind a proxy, the following configuration directives let you set the necessary parameters:
HTTP_PROXY - Use this parameter to
specify the proxy server that searchbox should use to access the
"external" Internet. This parameter should have the form
host:port, for example
proxy:80.
HTTP_PROXYUSERPWD - In some
circumstances the proxy might have access control enabled, you can
specify it using this option in the form
user:pass.
HTTP_NOPROXY - This parameter lets you
distinguish local servers from the "external" Internet. Local
servers will be accessed without going through the proxy. You should
specify local servers as a list of hostnames separated by spaces,
such as www.our.net our.server.net. The
specified hostnames are intended as suffixes, so our.net
would match www.our.net ftp.our.net and all
other third level servers.
searchbox can send notifications by email or instant messages using external facilities:
SMTPServer - Outgoing mail server
host:port
SMTPSS - yes if
we should try SMTP/SSL
SMTPUser - SMTP authentication
user
SMTPPasswd - SMTP authentication
password
SMTPSecureAuth - If
yes don't use SMTP insecure authentication
methods
SMTPFrom - SMTP From address
SMTPReplyTo - SMTP Reply-to
address
JabberServer - Jabber server
host:port port is optional and defaults to
5222
JabberID - Jabber ID searchbox will use
(name@domain or
name@domain/resource) default resource is
searchbox
JabberPasswd - Jabber password for
JabberID
JabberSASL - yes
if the server uses SASL auth (most don't)
JabberNotificationResultsPerMessage -
Number of notification results in each jabber message
JabberNotificationMessageDelay - Delay
in seconds between two notification jabber messages
LOGCFG determines how much information will
be logged to the files inside the log directory in the searchbox data
directory. It is a space-separated list of facility:level pairs.
Facility names are:
API | API calls |
FFF | FFF handling |
HTTPSERVER | internal HTTP operations |
IPCMARSHAL | ipc marshallers |
IPCSOCK | ipc sockets |
FETCHEDDOCS | fetched documents |
KEY | Key checks |
LOG | log system itself |
PLUGINS | Messages from plugins |
PLUGINSYSTEM | The plugin framework |
WORKER | other worker activities |
WORKERFETCHERS | fetchers |
WORKERPARSERS | parsers |
WORKERSCHEDULING | scheduling of workers |
ossible level values, in decreasing order of verbosity, are
BORING, DEBUG,
INFO, WARNING,
ERROR, CRITICAL. The
default for all facilities is info, and it's generally not necessary to
change it, except possibly for debugging purposes.
focuseek searchbox is a well-behaved web spider and implements the
standard for robots exclusion (http://www.robotstxt.org/wc/norobots.html),
usually referred to as the robots.txt standard. The
robots.txt standard lets webmasters disable spidering of portions of
their site by placing a suitably formatted
robots.txt file in the root directory of their
site. The robots.txt standard requires that each spider identifies
itself using a user agent string. This string is used also to disable
some robots spidering the site while still allowing others to access it.
See the standard for the details.
You can specify the user agent used by searchbox with the
UserAgent parameter in the configuration file. It
defaults to focuseekbot. Such parameters can be
overridden by the specific configuration for each Archive (see the
Control Panel documentation)
The UserAgent parameter is also used to
identify the spider in the HTTP User-Agent request header. focuseek
searchbox uses the following HTTP user agent:
Mozilla/4.0 (compatible; agent)
Where agent is the value of the
UserAgent parameter in the configuration fine.
For further details on the HTTP user agent see for example http://www.mozilla.org/build/user-agent-strings.html
Index optimization will occur daily. Since it is a very disk anc
cpu consuming operation, you can choose the exact hour at which it will
occur. The OptimizeAt parameter holds the GMT
hour at which index optimization will occur. You can disable index
optimization by setting the OptimizeAt parameter
to -1.
When the optimization is enabled some other parameters can be tuned:
MERGEFACTOR - A number that says how
often the indexer merges index segments, when adding documents.
Merging takes time, so a low merge-factor number (resulting in many
merges per number of documents added), will slow down indexing.
However, using a low-number here keeps the number of index segments
low, which speeds up searching, and causes the indexer to use less
file handles.
MINMERGEDOCS - (default value is 10)
Determines how often segments are created (how many documents that
are added before they are «assembled» into a segment).
MAXMERGEDOCS - The maximum number of
documents that can be stored in an index segment. When this value is
reached in a segement it will not grow any larger. As a result
thyere may be more segments than expected from looking at the merge
factor
MAXCLAUSECOUNT - The maximum number of
term allowed in and/or expressions into queries
MAXRAM - The maximum amount of RAM that
the indexing service process process can use
MAXFIELDLENGTH - The maximum number of
tokens used to generate the index entry for a document. So for full
text indexing, only this number of tokens will be considered.
MAXWORDLENGTH - The maximum acceptable
length for tokens used to generate the index entry for a document. So
for full text indexing, long words will be truncated to this
value.
For performace reasons, documents are not added to index one at
once, but in batches. However, an index flush will be forced from time
to time. The FlushLatency parameter holds the
number of seconds between index flushes.
The SyncLevel parameter settles how often the internal database flushes its data on disk.
SyncLevel
value | Description |
|---|---|
0 | No sync, DB will be corrupted after a power failure |
1 | Normal sync (default), very low probability of corruption after power failure |
2 | Full sync, maximum safety but slower performance |
searchbox has its own internal http server used to implement the Web Service interface. The following parameters determines its behaviour:
HTTP_PORT - The HTTP port number used
by searchbox to expose itself as Web Service (configurable from the
Control Panel)
HTTP_USER - The administrator
username
HTTP_MD5PASS - The MD5 encrypted
administator password (configurable from the Control Panel)
HTTP_THREADS - Number of availble
threads
HTTP_THREADS impacts a lot on memory footprint as for each thread when a query process is created.
HTTP_URL_PREFIX - The public address
for the HTTP portal server. Defaults to
http://<hostname>:<HTTP_PORT>/
Inthe case you want to disable access to searchbox through the
internal Enterprise Search Portal set the
PORTAL_ENABLED parameter value as
NO.
Worker is the searchbox component responsible of all processing
activities on every single document. Sometime it happens that a worker
process hangs due to the impossibility to process a document. The
WORKER_TIMEOUT parameter sets the number of
seconds that searchbox waits before killing a hanged process.
Gathering a large document set can lead to disk space exhaustion.
Searchbox will check for free disk space and halt the gathering process
when disk free space falls below the value set in the
MINADDFREEDISK parameter (in MB).
You can allow the gathering process to run only during a certain
period of the day. For example, you could allow gathering only during
the night. To do this, you can change the
CRAWLSTART and CRAWLSTOP
parameters, which hold the start and stop hours (GMT). You can
completely stop the gathering by setting one of these parameters to
-1.
Note that the crawls will be put in a sleep state outside of this interval of time, and they will resume from the point where they left. This is different from stopping and restarting a crawl: in this case gathering will start again from seeds.
searchbox can process suitably formattted metadata placed in HTML
files. To enable this feature you must set the
USE_SEARCHBOX_METAS parameter to
1. The default value is 0. See the section called “Collecting metadata embedded in HTML documents” for further explanations.