Chapter 3. Integrating searchbox

Table of Contents

SOAP API
Complex data types
enum AccessType
struct AccessSpec
enum RootACLType
enum PluginType
enum ExtendedPluginType
struct PluginDLLInfo
struct ExtendedPluginValue
struct ExtendedPlugin
struct ExtendedPluginInfo
struct ExtendedPluginsBundle
struct Plugin
struct PluginInfo
struct DocFilter
struct DocFilterInfo
struct RendererPluginPipe
enum AuthType
enum CookieParamType
struct CookieParam
enum CookieActionType
struct CookieAuthParams
struct MetadataKey
struct MetadataValue
struct SourceConfiguration
struct ParamValue
struct AuthConfiguration
enum Frequency
enum Cache
struct ArchiveConfiguration
struct CollectionConfiguration
enum QueryView
enum QueryParser
enum WatchNotificationMedia
struct WatchNotificationEndpoint
struct WatchNotificationTiming
enum QueryInfo
struct QuerySliceWeight
enum NotificationDetail
struct WatchNotificationConfiguration
struct WatchConfiguration
struct MetadataTemplateConfiguration
enum QuerySort
enum QueryAtomType
struct QueryAtom
struct QuerySpec
struct QueryResult
struct SourceInfo
struct ArchiveInfo
struct CollectionInfo
struct WatchInfo
struct MetadataTemplateInfo
enum PageFormat
enum Status
struct ExtendedStatus
struct CrawlLog
enum CrawlError
struct CrawlLogData
Methods
AddSource
RemoveSource
GetSourceConfiguration
SetSourceConfiguration
EnumSourceConfigurations
AddAuth
RemoveAuth
GetAuthConfiguration
SetAuthConfiguration
AddArchive
RemoveArchive
GetArchiveConfiguration
SetArchiveConfiguration
EnumArchiveConfigurations
GetArchiveInfo
AddCollection
RemoveCollection
GetCollectionConfiguration
SetCollectionConfiguration
EnumCollectionConfigurations
AddWatch
RemoveWatch
GetWatchConfiguration
SetWatchConfiguration
EnumWatchConfigurations
NotifyWatches
AddMetadataTemplate
RemoveMetadataTemplate
GetMetadataTemplateConfiguration
SetMetadataTemplateConfiguration
EnumMetadataTemplateConfigurations
Query
GetDocument
GetDocumentURL
GetDocumentIDs
AddDocument
RemoveDocument
GetKeepTime
SetKeepTime
ApplyMetadataTemplate
DeapplyMetadataTemplate
EnumAppliedMetadataTemplates
GetStatus
GetVersion
NormalizeURL
StartCrawl
StopCrawl
ReprocessDocuments
GetAvailableCrawlLogs
GetCrawlLog
AddUser
RemoveUser
GetUserData
SetUserData
SetUserPassword
AddGroup
RemoveGroup
EnumUsersAndGroups
GetRootACL
SetRootACL
EnumPluginDLLs
EnumExtendedPlugins
AddExtendedPlugin
RemoveExtendedPlugin
GetExtendedPlugin
SetExtendedPlugin
GetRootExtendedPluginsBundle
SetRootExtendedPluginsBundle
EnumPlugins
AddPlugin
RemovePlugin
GetPlugin
SetPlugin
EnumDocFilters
AddDocFilter
RemoveDocFilter
GetDocFilter
SetDocFilter
GetRootRendererPipe
SetRootRendererPipe
The SOAP notification service requirements
Complex data types
struct MetadataKey
struct MetadataValue
struct QueryResult
Methods
WatchNotify

This chapter describes how searchbox can be integrated in other applications. The searchbox Engine exposes its functionalities as a standard Web Service so that they can be used throught standard SOAP remote calls.

SOAP API

Complex data types
enum AccessType
struct AccessSpec
enum RootACLType
enum PluginType
enum ExtendedPluginType
struct PluginDLLInfo
struct ExtendedPluginValue
struct ExtendedPlugin
struct ExtendedPluginInfo
struct ExtendedPluginsBundle
struct Plugin
struct PluginInfo
struct DocFilter
struct DocFilterInfo
struct RendererPluginPipe
enum AuthType
enum CookieParamType
struct CookieParam
enum CookieActionType
struct CookieAuthParams
struct MetadataKey
struct MetadataValue
struct SourceConfiguration
struct ParamValue
struct AuthConfiguration
enum Frequency
enum Cache
struct ArchiveConfiguration
struct CollectionConfiguration
enum QueryView
enum QueryParser
enum WatchNotificationMedia
struct WatchNotificationEndpoint
struct WatchNotificationTiming
enum QueryInfo
struct QuerySliceWeight
enum NotificationDetail
struct WatchNotificationConfiguration
struct WatchConfiguration
struct MetadataTemplateConfiguration
enum QuerySort
enum QueryAtomType
struct QueryAtom
struct QuerySpec
struct QueryResult
struct SourceInfo
struct ArchiveInfo
struct CollectionInfo
struct WatchInfo
struct MetadataTemplateInfo
enum PageFormat
enum Status
struct ExtendedStatus
struct CrawlLog
enum CrawlError
struct CrawlLogData
Methods
AddSource
RemoveSource
GetSourceConfiguration
SetSourceConfiguration
EnumSourceConfigurations
AddAuth
RemoveAuth
GetAuthConfiguration
SetAuthConfiguration
AddArchive
RemoveArchive
GetArchiveConfiguration
SetArchiveConfiguration
EnumArchiveConfigurations
GetArchiveInfo
AddCollection
RemoveCollection
GetCollectionConfiguration
SetCollectionConfiguration
EnumCollectionConfigurations
AddWatch
RemoveWatch
GetWatchConfiguration
SetWatchConfiguration
EnumWatchConfigurations
NotifyWatches
AddMetadataTemplate
RemoveMetadataTemplate
GetMetadataTemplateConfiguration
SetMetadataTemplateConfiguration
EnumMetadataTemplateConfigurations
Query
GetDocument
GetDocumentURL
GetDocumentIDs
AddDocument
RemoveDocument
GetKeepTime
SetKeepTime
ApplyMetadataTemplate
DeapplyMetadataTemplate
EnumAppliedMetadataTemplates
GetStatus
GetVersion
NormalizeURL
StartCrawl
StopCrawl
ReprocessDocuments
GetAvailableCrawlLogs
GetCrawlLog
AddUser
RemoveUser
GetUserData
SetUserData
SetUserPassword
AddGroup
RemoveGroup
EnumUsersAndGroups
GetRootACL
SetRootACL
EnumPluginDLLs
EnumExtendedPlugins
AddExtendedPlugin
RemoveExtendedPlugin
GetExtendedPlugin
SetExtendedPlugin
GetRootExtendedPluginsBundle
SetRootExtendedPluginsBundle
EnumPlugins
AddPlugin
RemovePlugin
GetPlugin
SetPlugin
EnumDocFilters
AddDocFilter
RemoveDocFilter
GetDocFilter
SetDocFilter
GetRootRendererPipe
SetRootRendererPipe

In this section you can find complete description of searchbox's SOAP interface defined by a standard WSDL.

The searchbox WSDL definition and SOAP interface are fully interoperable with the Microsoft .NET platform, Apache Software Foundation's Axis (Java) and numerous other SOAP implementations.

Complex data types defined by the interface and all methods will be described in detail.

Assuming you are running searchbox on the local computer on the default port, you can access the WSDL through an URL of the form:

http://user:password@localhost:2200/wsdl

and the SOAP endpoint (included in the WSDL) will have the form:

http://user:password@localhost:2200/soap

Complex data types

Methods of searchbox SOAP interface use the following complex types:

enum AccessType

enum AccessType
{
  READACCESS,
  WRITEACCESS
}

Used to specify the ACL entry access type, it can have the following values:

  • READACCESS - Read access.

  • WRITEACCESS - Write access.

struct AccessSpec

struct AccessSpec
{
  string Name;
  AccessType Access;
  boolean Deny;
}

Used to specify an ACL entry, it has the following fields:

  • string Name - The name of the user or group this entry refers to.

  • AccessType Access - The type of access you want to grant or deny.

  • boolean Deny - TRUE if you want to deny the specified access, FALSE to grant it.

enum RootACLType

enum RootACLType
{
  ROOTACL_BROWSING,
  ROOTACL_CRAWLING
}

Used to specify the root ACL type, it can have the following values:

  • ROOTACL_BROWSING - Specifies the Browsing root ACL. This ACL controls the rights to enumerate users and groups (READACCESS), and create new collections and watches (WRITEACCESS).

  • ROOTACL_CRAWLING - Specifies the Crawling root ACL. This ACL controls the rights to create new sources and archives (WRITEACCESS).

  • ROOTACL_PROCESSING - Specifies the Processing root ACL. This ACL controls the rights to create new metadata templates (WRITEACCESS).

enum PluginType

enum PluginType
{
  UNKNOWN_PLUGINTYPE,
  ARCHIVEFFF_PLUGINTYPE,
  EXTENDED_PLUGINTYPE
}

Describes the type of a plugin or plugin dll

  • UNKNOWN_PLUGINTYPE - Unknown type; this is usually due to a buggy plugin.

  • ARCHIVEFFF_PLUGINTYPE - Undocumented, for backwards compatibility only.

  • EXTENDED_PLUGINTYPE - A focuseek plugin.

enum ExtendedPluginType

enum ExtendedPluginType
{
  UNKNOWN_EXTENDED_PLUGINTYPE,
  PROTOCOL_EXTENDED_PLUGINTYPE,
  PARSER_EXTENDED_PLUGINTYPE,
  RENDERING_EXTENDED_PLUGINTYPE
}

Describes the type of an extended plugin

  • UNKNOWN_EXTENDED_PLUGINTYPE - Unknown type; this is usually due to a buggy plugin.

  • PROTOCOL_EXTENDED_PLUGINTYPE - A protocol plugin.

  • PARSER_EXTENDED_PLUGINTYPE - A parser plugin.

  • RENDERING_EXTENDED_PLUGINTYPE - A rendering plugin.

struct PluginDLLInfo

struct PluginDLLInfo
{
  PluginType Type;
  string Name;
  string Description;
  string Producer;
}

Informations on a plugin dll.

  • PluginType Type - The type of the plugin dll.

  • string Name - The name of the plugin dll.

  • string Description - The description of the plugin dll.

  • string Producer - The producer of the plugin dll.

struct ExtendedPluginValue

struct ExtendedPluginValue
{
  string Key;
  string Value;
}

The name and value of an extended plugin parameter.

  • string Key - The parameter name.

  • string Value - The parameter value.

struct ExtendedPlugin

struct ExtendedPlugin
{
  string Name;
  string Description;
  ExtendedPluginType ExtendedType;
  string DLLName;
  integer ParentID;
  boolean FullyConfigured;
  string[] UnsetConfigurationKeys;
  ExtendedPluginValue[] SetConfigurationValues;
  AccessSpec[] ACL;
}

The configuration describing an extended plugin.

  • string Name - The plugin name. It is meant to be human-readable.

  • string Description - The plugin description. It is meant to be human-readable.

  • ExtendedPluginType ExtendedType - The plugin extended type. This field is filled by the plugin and is ignored when passed by the user.

  • string DLLName - The name of the plugin dynamic link library.

  • integer ParentID - The id of the parent plugin this plugin inherits its configuration from. If this is 0 then the plugin has no parent. Note that DLLName is ignored when ParentID is not 0.

  • boolean FullyConfigured - If true the plugin configuration is complete and the plugin can be used. This field is filled by the plugin and is ignored when passed by the user.

  • string[] UnsetConfigurationKeys - The names of the parameters the plugin requires but that are not set. This field is filled by the plugin and is ignored when passed by the user.

  • ExtendedPluginValue[] SetConfigurationValues - The set parameters and their values.

  • AccessSpec[] ACL - Access control list for this plugin.

struct ExtendedPluginInfo

struct ExtendedPluginInfo
{
  integer ExtendedPluginID;
  ExtendedPlugin ExtendedPluginConfiguration;
  integer Magic;
  boolean ReadOnly;
}

Informations about an extended plugin.

  • integer ExtendedPluginID - The plugin id.

  • ExtendedPlugin ExtendedPluginConfiguration - The plugin configuration.

  • integer Magic - Magic number to use in SetExtendedPlugin.

  • boolean ReadOnly - If true the ACL allows the user read only access to the plugin but not write access.

struct ExtendedPluginsBundle

struct ExtendedPluginsBundle
{
  ExtendedPluginType ExtendedType;
  integer[] ExtendedPluginIDs;
}

Informations about an extended plugin.

  • ExtendedPluginType ExtendedType - The type of the plugins in this bundle.

  • integer[] ExtendedPluginIDs - The id of the plugins making up the bundle.

struct Plugin

Undocumented, for backwards compatibility only.

struct PluginInfo

Undocumented, for backwards compatibility only.

struct DocFilter

Undocumented, for backwards compatibility only.

struct DocFilterInfo

Undocumented, for backwards compatibility only.

struct RendererPluginPipe

Undocumented, for backwards compatibility only.

enum AuthType

enum AuthType
{
  AUTH_NONE,
  AUTH_BASIC,
  AUTH_COOKIE,
  AUTH_SSLCERT
}

Used to specify the authentication type, it can have the following values:

  • AUTH_NONE - No authentication.

  • AUTH_BASIC - Use plain username/password authentication.

  • AUTH_COOKIE - Use cookie authentication (only for HTTP/HTTPS).

  • AUTH_COOKIE - Use an SSL certificate for authentication.

enum CookieParamType

enum CookieParamType
{
  USERNAME,
  PASSWORD,
  OTHER
}

Used to specify the cookie request parameter type, it can have the following values:

  • USERNAME - The parameter is an username.

  • PASSWORD - The parameter is a password.

  • OTHER - None of the above.

struct CookieParam

struct CookieParam
{
  string Name;
  string Value;
  CookieParamType Type;
}

Used to specify a cookie request parameter, it has the following fields:

  • string Name - Name of the parameter.

  • string Value - Value of the parameter.

  • CookieParamType Type - Type of the parameter.

enum CookieActionType

enum CookieActionType
{
  GET,
  POST
}

Used to specify the cookie request action type, it can have the following values:

  • GET - The request action will be an HTTP GET.

  • POST - The request action will be an HTTP POST.

struct CookieAuthParams

struct CookieAuthParams
{
  string Url;
  CookieActionType Action;
  integer Freshness;
  CookieParam[] Params;
  string[] PreUrls;
}

Used to specify the cookie authentication parameters, it has the following fields:

  • string Url - Cookie request URL.

  • CookieActionType Action - Cookie request action.

  • integer Freshness - Cookie freshness, in seconds.

  • CookieParam[] Params - Cookie request parameters.

  • string[] PreUrls - Cookie request pre URLs. This sequence of URLs will be visited before accessing the cookie request URL, and cookies returned by the server will be accumulated and used (if applicable) in the cookie request.

struct MetadataKey

struct MetadataKey
{
  string Key;
  integer Slice;
}

Used to specify a metadata key, it has the following fields:

  • string Key - Metadata key.

  • integer Slice - Slice where metadata is placed.

struct MetadataValue

struct MetadataValue
{
  MetadataKey Key;
  string Value;
}

Used to specify a metadata value, it has the following fields:

  • MetadataKey Key - Metadata key.

  • string Value - Metadata value.

struct SourceConfiguration

struct SourceConfiguration
{
  string Name;
  string Description;
  string[] Seeds;
  string[] IncludeFilters;
  string[] ExcludeFilters;
  integer DepthLimit;
  boolean CheckRobotsTxt;
  boolean HasSideMetadata;
  AuthType Type;
  CookieAuthParams CookieParams;
  integer[] RendererDocFilterIDs;
  ExtendedPluginsBundle[] ExtendedPluginsBundles;
  string[] TextExclusionBeginMarker;
  string[] TextExclusionEndMarker;
  AccessSpec[] ACL;
}

Used to define parameters of a source, it has the following fields:

  • string Name - Source name, used by client.

  • string Description - Source description, used by client.

  • string[] Seeds - List of seed URLs for crawling.

  • string[] IncludeFilters - Regular expressions that must be matched for an URL to be accepted by this source (including seeds).

  • string[] ExcludeFilters - Regular expressions that must not be matched for an URL to be accepted by this source (including seeds).

  • integer DepthLimit - Crawl depth limit. "0" means no depth limit. The seed is counted as 1 level, and every hyperlink followed starting from the seed URLs is counted as one additional level, so this value minus 1 is the maximum number of hyperlinks that can be followed starting from the seeds. All documents further away from the seeds will not be crawled.

  • boolean CheckRobotsTxt - Set to TRUE to obey robots.txt rules during crawl, set to FALSE to crawl URLs regardless of exclusions set by the web master (only for HTTP/HTTPS). Beware that avoiding to follow robots.txt rules will probably lead to a permanent ban of your IP address by the web site administrator!

  • boolean HasSideMetadata - Set to TRUE if you want to fetch and parse side-by-side metadata files.

  • AuthType Type - Type of authentication to use for crawling.

  • CookieAuthParams CookieParams - Cookie parameters, needed when using Cookie authentication (only for HTTP).

  • integer[] RendererDocFilterIDs - This field is used for backwards compatibility only.

  • ExtendedPluginsBundle[] ExtendedPluginsBundles - The extended plugin bundles to use for this source. Note that at most one bundle per extended plugin type is allowed for each source.

  • integer[] TextExclusionBeginMarker - A regular expression identifying the start of document[1] sections searchbox should completely ignore. Any text or formatting elements from a TextExclusionBeginMarker and the next TextExclusionEndMarker is ignored. The regular expression is matched against the source of the document itself.

  • integer[] TextExclusionEndMarker - A regular expression identifying the end of document sections searchbox should ignore. See TextExclusionBeginMarker above for more details.

  • AccessSpec[] ACL - Access control list for this configuration.

struct ParamValue

struct ParamValue
{
  string Name;
  string Value;
}

Used to specify the value you want to assign to a cookie parameter, it has the following fields:

  • string Name - The name of the parameter. You can specify only parameters of type OTHER.

  • string Value - The value of the parameter.

struct AuthConfiguration

struct AuthConfiguration
{
  string Username;
  string Password;
  ParamValue[] ParamValues;
  string SSLCertificate;
  AccessSpec[] ACL;
}

Used to specify an authentication configuration, it has the following fields:

  • string Username - Username (for Basic and Cookie authentication).

  • string Password - Password (for Basic and Cookie authentication).

  • ParamValue[] ParamValues - List of cookie parameter values (only for Cookie authentication).

  • string SSLCertificate - The SSL certificate to use, in Privacy-Enhanced Electronic Mail (PEM) format[2].

  • AccessSpec[] ACL - Access control list for this configuration.

enum Frequency

enum Frequency
{
  DISABLED,
  HOURLY,
  DAILY,
  WEEKLY,
  MONTHLY
}

Used to set the base refresh frequency of an archive, it can have the following values:

  • DISABLED - Refresh disabled

  • HOURLY - Hourly refresh (every N hours)

  • DAILY - Daily refresh (every N days)

  • WEEKLY - Weekly refresh (every given day of week)

  • MONTHLY - Monthly refresh (every given day of month)

enum Cache

enum Cache
{
  FULLCACHE,
  CONTEXTCACHE,
  NOCACHE
}

Used to set the level of document caching, it can have the following values:

  • FULLCACHE - Retain full document cache

  • CONTEXTCACHE - Retain minimal cache needed for context extraction

  • NOCACHE - Don't retain any cache

struct ArchiveConfiguration

struct ArchiveConfiguration
{
  string Name;
  boolean Historicize;
  string AdministrativeContact;
  integer Source;
  integer Auth;
  Frequency SchedulingFrequency;
  integer SchedulingAttribute;
  integer AccessTime;
  integer PagesLimit;
  integer GarbageLimit;
  Cache DocumentCache;
  integer Throttling;
  AccessSpec[] ACL;
}

Used to define parameters of an archive, it has the following fields:

  • string Name - Archive name, used by client.

  • boolean Historicize - Enable historicization. The value that is initially set with the AddArchive call is retained throughout all the lifecycle of the archive, i.e. you cannot change its value with the SetArchiveConfiguration call.

  • string AdministrativeContact - Email address of administrative contact.

  • integer Source - ID of the source configuration used to crawl documents for this archive.

  • integer Auth - ID of the authentication configuration used to crawl documents for this archive when an authentication type other than AUTH_NONE is specified in the source configuration. Use 0 if AUTH_NONE is used in the source configuration.

  • Frequency SchedulingFrequency - Base archive automatic refresh frequency.

  • integer SchedulingAttribute - Base refresh frequency multiplication factor. If the base frequency is HOURLY or DAILY, the multiplication stands for the interval in hours or day between two crawls. If the base frequency is WEEKLY, this is the day of week the crawl is intended to start on (0=Sunday, 1=Monday, etc.). If the base frequency is MONTHLY, this is the day of the month the crawl is intended to start on.

  • integer AccessTime - Start of crawl. If the base frequency is HOURLY it can have a value between 0 and SchedulingAttribute and sets the time of the first daily crawl, expressed as an offset from 00:00 GMT. If the base frequency is DAILY, WEEKLY or MONTHLY it can have a value between 0 and 23 and sets the GMT time of crawl.

  • integer PagesLimit - Maximum number of documents to fetch during a crawl session. "0" mean no limit. When during a crawl this limit is reached the crawling session is terminated.

  • integer GarbageLimit - Minimum age of a document before garbage collection. "0" mean no garbage collection. The age is expressed in seconds and is measured from the last time the document was fetched.

  • Cache DocumentCache - Caching level to be used for this archive.

  • integer Throttling - The minimum interval (in seconds) searchbox should wait after fetching a document for the archive before attempting to fetch another.

  • AccessSpec[] ACL - Access control list for this configuration.

struct CollectionConfiguration

struct CollectionConfiguration
{
  string Name;
  string Description;
  integer[] Archives;
  AccessSpec[] ACL;
}

Used to define parameters of a collection, it has the following fields:

  • string Name - Collection name, used by client.

  • string Description - Collection description, used by client.

  • integer[] Archives - IDs of the archives that build up this collection.

  • AccessSpec[] ACL - Access control list for this configuration.

enum QueryView

enum QueryView
{
  VIEW_PUBLISHED,
  VIEW_CORECHANGED
}

Used to restrict the set of documents returned as result of a query:

  • VIEW_PUBLISHED - The query is applied to all the documents currently in the archive.

  • VIEW_CORECHANGED - The query is applied only to the documents currently in the archive that have changed in the core of the text. Only applicable to an historicizing archive.

enum QueryParser

enum QueryParser
{
  NOPARSER,
  RPNPARSER,
  ALGPARSER,
  NETPARSER
}

Used to specify the query string parser to use:

  • NOPARSER - Don't parse, the query is submitted using QueryAtoms.

  • RPNPARSER - Use the RPN parser.

  • ALGPARSER - Use the ALG parser.

  • NETPARSER - Use the NET parser.

enum WatchNotificationMedia

enum WatchNotificationMedia
{
  MAIL_WATCH_NOTIFICATION_MEDIA,
  JABBER_WATCH_NOTIFICATION_MEDIA,
  SOAP_WATCH_NOTIFICATION_MEDIA
}

The protocol (media) to use when sending a notification:

  • MAIL_NOTIFICATION_MEDIA - The notification will be sent using e-mail (SMTP).

  • JABBER_NOTIFICATION_MEDIA - The notification will be sent using the Jabber instant message protocol.

  • SOAP_NOTIFICATION_MEDIA - The notification will be sent by calling a remote web service with the SOAP protocol. See the section called “The SOAP notification service requirements” for details on the called service.

struct WatchNotificationEndpoint

struct WatchNotificationEndpoint
{
  WatchNotificationMedia Media;
  string Address;
}

The destination to send a notification to:

  • WatchNotificationMedia Media - The protocol to use when sending the notification.

  • string Address - The addres to send the notification to. The format of the address is media dependent; see Table 3.1, “Notification address formats”.

Table 3.1. Notification address formats

MediaDescription
MAIL_WATCH_NOTIFICATION_MEDIAAn RFC 822 e-mail address, for example support@focuseek.com.
JABBER_WATCH_NOTIFICATION_MEDIAA Jabber ID in user@domain/resource, for example somebody@jabber.org/client. /resource is usually omitted. See http://www.jabber.org/ for more details.
SOAP_WATCH_NOTIFICATION_MEDIAThe SOAP endpoint to call: an http or http URL, for example http://localhost:8080/MyService.
ACTION_WATCH_NOTIFICATION_MEDIAIn this case the address specifies the action searchbox will perform when the notification is triggered. The address is a sequence of tokens separated by a single ASCII space. The first token is always present and specifies the action to perform.
ActionDescriptionFirst argumentOther arguments
applytemplateAdd a template to all the documents in the notificationThe numeric ID of the template to applyOptional metadata assignments for the template variable metadata. Each token is in the form key:val=sliceid where key is the metadata key for one of the template variable metadata, val is the value to assign and sliceid is the numeric id of the slice the metadata is assigned [a]
deapplytemplateRemove a template from all the documents in the notificationThe numeric ID of the template to removeNone
removeRemove from searchbox all the documents in the notificationNoneNone
keepuntilForce searchbox garbage collector to ignore all the documents in the notification for a specified timeThe moment in time searchbox will keep the documents anyway, in seconds since the unix epoch (January 1st, 1970). See also the section called “SetKeepTime”.None

[a] For a list of valid slice ids see the section called “struct QuerySliceWeight”

[a] For a list of valid slice ids see the section called “struct QuerySliceWeight”

struct WatchNotificationTiming

struct WatchNotificationTiming
{
  Frequency SchedulingFrequency;
  integer SchedulingAttribute;
  integer AccessTime;
}

Defines how oftenthe notification will be sent. Please note that a notification is never sent anyway if there are no documents in it. The structure has the following fields:

  • Frequency SchedulingFrequency - Base notification frequency.

  • integer SchedulingAttribute - Base notification frequency multiplication factor. If the base frequency is HOURLY or DAILY, the multiplication stands for the interval in hours or day between two notifications. If the base frequency is WEEKLY, this is the day of week the notification will be sent on (0=Sunday, 1=Monday, etc.). If the base frequency is MONTHLY, this is the day of the month the notification is intended to be sent.

  • integer AccessTime - Time of notification. If the base frequency is HOURLY it can have a value between 0 and SchedulingAttribute and sets the time of the first daily notification, expressed as an offset from 00:00 GMT. If the base frequency is DAILY, WEEKLY or MONTHLY it can have a value between 0 and 23 and sets the GMT time of the notification.

enum QueryInfo

enum QueryInfo
{
  INFO_NONE,
  INFO_URL,
  INFO_TITLE,
  INFO_CONTEXT
  INFO_TEMPLATE_METADATA,
  INFO_ALL_METADATA
}

Used to specify the information returned as result of a query:

  • INFO_NONE - For each result no additional info is returned.

  • INFO_URL - For each result the URL is returned.

  • INFO_TITLE - For each result the URL and the title is returned.

  • INFO_CONTEXT - For each result the URL, the title, the mime type and the contexts where the keywords specified into the query have been found are returned.

  • INFO_TEMPLATE_METADATA - The same as INFO_CONTEXT but also returns the metadata added by templates.

  • INFO_ALL_METADATA - The same as INFO_TEMPLATE_METADATA but returns all the metadata.

struct QuerySliceWeight

struct QuerySliceWeight
{
  integer Slice;
  integer Weight;
}

Used to specify slice weights, it has the following fields:

  • integer Slice - Dict ID. The following dict IDs can be used:

    • 1 - Author

    • 2 - Keyword

    • 3 - Abstract

    • 4 - Invisible

    • 5 - Marginal normal text

    • 6 - Marginal emphasized text

    • 7 - Marginal link text

    • 8 - Marginal remote link text

    • 9 - Marginal header text

    • 10 - Central normal text

    • 11 - Central emphasized text

    • 12 - Central link text

    • 13 - Central remote link text

    • 14 - Central header text

    • 15 - Title

  • integer Weight - Slice weight. Must be greater or equal to 0.

enum NotificationDetail

enum NotificationDetail
{
  NOTIFICATION_DETAIL_NONE,
  NOTIFICATION_DETAIL_WATCH,
  NOTIFICATION_DETAIL_RESULTS
}

Used to specify how much detail to send in the notification. It can have the following values:

  • NOTIFICATION_DETAIL_NONE - No detail is sent about the documents or the watch.

  • NOTIFICATION_DETAIL_WATCH - Data identifying the watch sending the notification will be included.

  • NOTIFICATION_DETAIL_RESULTS - In addition to the above, the notification will include data about the matched documents.

struct WatchNotificationConfiguration

struct WatchNotificationConfiguration
{
  string Name;

  boolean IgnoreWatchFreshness;
  long MaxNotificationResults;
  WatchNotificationEndpointSeq Endpoints;
  WatchNotificationTiming Timing;
  QueryInfo Info;
  NotificationDetail Detail;

  string TitleString;
  string HeaderString;
  string ResultString;
  string TailString;
}

Used to define parameters of a watch, it has the following fields:

Table 3.2. Special character sequences to use in notification strings

StringDescriptionCan be used
%%Expands to a single percent character (%).In any notification string
%{wid}Expands to the numeric id of the watch that is sending the notification.In any notification string
%{wname}Expands to the name of the watch that is sending the notification.In any notification string
%{wurl}Expands to the URL of the watch results page.In any notification string
%{rtitle}Expands to the document title.Only in ResultString
%{rid}Expands to the document ID.Only in ResultString
%{rctxs}Expands to the watch contexts for the document.Only in ResultString
%{rurl}Expands to the original document url.Only in ResultString
%{rcache}Expands to the url of the document in searchbox cache.Only in ResultString

struct WatchConfiguration

struct WatchConfiguration
{
  string Name;
  string Description;
  string Query;
  QueryView View;
  QueryParser Parser;
  integer Freshness;
  integer Collection;
  QuerySliceWeight[] Weights;
  WatchNotificationConfiguration[] WatchNotifications;
  AccessSpec[] ACL;
}

Used to define parameters of a watch, it has the following fields:

  • string Name - Watch name, used by client.

  • string Description - Watch description, used by client.

  • string Query - Watch filter query.

  • QueryView View - The view to query.

  • QueryParser Parser - Watch filter query parser.

  • integer Freshness - Watch results minimum freshness.

  • integer Collection - ID of the collection monitored by this watch.

  • QuerySliceWeight[] Weights - The weights to use for the query.

  • WatchNotificationConfiguration[] WatchNotifications - The notifications sent by this watch.

  • AccessSpec[] ACL - Access control list for this configuration.

struct MetadataTemplateConfiguration

struct MetadataTemplateConfiguration
{
  string Name;
  string Description;
  MetadataValue[] FixedMetadata;
  MetadataKey[] VariableMetadata;
  AccessSpec[] ACL;
}

Used to define a metadata template, it has the following fields:

  • string Name - Template name, used by the client.

  • string Description - Template description, used by the client.

  • MetadataValue[] FixedMetadata - Fixed-value metadata, when the template is applied to a document these metadata will be added as-is.

  • MetadataKey[] VariableMetadata - Variable-value metadata, the value for the metadata may be specified when the template is applied to a document.

  • AccessSpec[] ACL - Access control list for this configuration.

enum QuerySort

enum QuerySort
{
  SORT_STANDARD,
  SORT_RELEVANCE,
  SORT_SCORE,
  SORT_TIME_NEWER,
  SORT_TIME_OLDER
}

Used to specify the sorting of documents returned as result of a query:

  • SORT_STANDARD - The standard sorting is used.

  • SORT_RELEVANCE - The documents are ordered by relevance score.

  • SORT_SCORE - The documents are ordered by their intrinsic score.

  • SORT_TIME_NEWER - The documents are ordered by change timestamp, more recently changed documents first.

  • SORT_TIME_OLDER - The documents are reverse-ordered by change timestamp, least recently changed documents first.

enum QueryAtomType

enum QueryAtomType
{
  ATOM_WORD,
  ATOM_WILDCARD_WORD,
  ATOM_NOT,
  ATOM_AND,
  ATOM_OR,
  ATOM_NEAR,
  ATOM_META,
  ATOM_META_RANGE,
  ATOM_WILDCARD_META
}

Used to specify the type of each QueryAtom (see forward). It can have the following values:

  • ATOM_WORD - QueryAtom is a keyword to find.

  • ATOM_WILCARD_WORD - QueryAtom is a keyword with wildcards. See the query syntax in the User Manual for details on wildcards.

  • ATOM_NOT - QueryAtom is a logic NOT.

  • ATOM_AND - QueryAtom is a logic AND between other QueryAtoms.

  • ATOM_OR - QueryAtom is a logic OR between other QueryAtoms.

  • ATOM_NEAR - QueryAtom is logic NEAR between words.

  • ATOM_META - QueryAtom is a specific metadata keyword and value to find.

  • ATOM_META_RANGE - QueryAtom is a metadata keyword to find; moreover allowed metadata values are restricted to a specified range.

  • ATOM_WILDCARD_META - QueryAtom is a metadata keyword to find; moreover allowed metadata values are restricted to those matching the value, which may contain wildcards.

struct QueryAtom

struct QueryAtom
{
  QueryAtomType Type;
  string Meta;
  string Param;
  string Param1;
}

Used to build a query in RPN notation, it has the following fields:

  • QueryAtomType Type - Current QueryAtom type.

  • string Meta - Contains the meta-keyowrd type. Only for META, META_RANGE and WILDCARD_META QueryAtoms.

  • string Param - If the current type is WORD, WILDCARD_WORD, META or WILDCARD_META this field contains the keyword (or metadata value) to find. Param might contain wildcards if the Type allows them. Otherwise, if the current type is AND, OR or NEAR it contains the decimal string representation of the number of QueryAtom involved in the expression. For the NOT type, only the value "1" is allowed in this field. Finally, if the atom type is META_RANGE this is the left extreme of the allowed (inclusive) range of metadata values.

  • string Param1 - If the atom type is META_RANGE this is the right extreme of the allowed (inclusive) range of metadata values. If the atom type is NEAR then this is the decimal string representation of a greater than zero integer, specifyind the sloppiness for the NEAR operation. The greater the sloppiness the farther NEAR looks in the documents for its arguments. A sloppiness of zero forces NEAR to look for adjacent words only. The field is unused for all the other atom types.

struct QuerySpec

struct QuerySpec
{
  integer[] Archives;
  integer Collection;
  integer Watch;
  integer FirstDoc;
  integer LastDoc;
  integer MinTime;
  integer MaxTime;
  integer MinScore;
  QueryInfo Info;
  QueryView View;
  QuerySort Sort;
  QueryParser Parser;
  QueryAtom[] Query;
  string QueryString;
  QuerySliceWeight[] Weights;
}

Used to submit a query, it has the following fields:

  • integer[] Archives - IDs of the archives you want to query. Leave empty if you want to query a collection or a watch.

  • integer Collection - ID of the collection you want to query. If you want to query archives or a watch, use 0.

  • integer Watch - ID of the watch you want to query. If you want to query archives or a collection, use 0.

  • integer FirstDoc - Index of the first document (starting from 0) returned. It must be less than LastDoc.

  • integer LastDoc - Index of the last document (starting from 0). It must be greater than FirstDoc.

  • integer MinTime - Oldest Timestamp (expressed in number of seconds since January 1st 1970 GMT) of query results. All older documents will be rejected.

  • integer MaxTime - Newest Timestamp (expressed in number of seconds since January 1st 1970 GMT) of query results. All newer documents will be rejected.

  • integer MinScore - Minimum score of query results. All documents with lower score will be rejected.

  • QueryInfo Info - Detail level of query results.

  • QueryView View - Document set restrictions of query.

  • QuerySort Sort - Result document set sorting type.

  • QueryParser Parser - Parser to use to parse the query string.

  • QueryAtom[] Query - Query in RPN notation all list of QueryAtoms. The QueryAtom sequence must produce a stack with only one element. Only used if NOPARSER is specified as QueryParser.

  • string QueryString - Query string to be parsed. Only used if RPNPARSER, ALGPARSER or NETPARSER is specified as QueryParser.

  • QuerySliceWeight[] Weights - Slice weights. You can pass an empty vector to use the default slice weights. To disable a slice in the current query, you must pass an entry for the slice with a weight of 0. If you don't pass an entry for a certain slice, that slice will have its default weight.

struct QueryResult

struct QueryResult
{
  string ID;
  string Url;
  string Title;
  string MimeType;
  string[] Contexts;
  integer Timestamp;
  integer Score;
  integer[] Archives;
  MetadataValue[] Metadata;
  long[] Templates;
  long CoreTextID;
}

Used to return information of a query result, it has the following fields:

  • string ID - ID of the document, guaranteed to be unique across archives.

  • string Url - URL of the document.

  • string Title - Title of the document.

  • string MimeType - Mime type of the document.

  • string[] Contexts - Contexts of the document where the keywords have been found.

  • integer Timestamp - Document timestamp (expressed in number of milliseconds since January 1st 1970 GMT).

  • integer Score - Document score (expressed as percentage * 10000).

  • integer[] Archives - The archives the document belongs to.

  • MetadataValue[] Metadata - The document metadata. Filled only if the query has detail INFO_TEMPLATE_METADATA or better. Note that while INFO_ALL_METADATA returns all document metadata INFO_TEMPLATE_METADATA returns only metadata added by templates.

  • long[] Templates - The numeric IDs of the templates applied to the document.

  • long CoreTextID - Documents with the same core text have the same CoreTextID.

struct SourceInfo

struct SourceInfo
{
  integer Source;
  SourceConfiguration Configuration;
  integer Magic;
  boolean ReadOnly;
}

Used to return basic information about a source, it has the following fields:

  • integer Source - ID of the source.

  • SourceConfiguration Configuration - Configuration of the source.

  • integer Magic - Magic number to use in SetSourceConfiguration.

  • boolean ReadOnly - True if the user cannot change the configuration.

struct ArchiveInfo

struct ArchiveInfo
{
  integer Archive;
  ArchiveConfiguration Configuration;
  integer Magic;
  boolean ReadOnly;
  boolean Crawling;
  integer Documents;
  integer Errors;
  integer LastRunStart;
  integer LastRunEnd;
  integer NextRun;
}

Used to return basic information about an archive, it has the following fields:

  • integer Archive - ID of the archive.

  • ArchiveConfiguration Configuration - Configuration of the archive.

  • integer Magic - Magic number to use in SetArchiveConfiguration.

  • boolean ReadOnly - True if the user cannot change the configuration.

  • boolean Crawling - TRUE if crawling is in progress.

  • integer Documents - Number of documents crawled in the current crawl (if crawling is in progress) or in the last completed crawl.

  • integer Errors - Number of crawl errors in the current crawl (if crawling is in progress) or in the last completed crawl.

  • integer LastRunStart - Begin timestamp (expressed in number of milliseconds since January 1st 1970 GMT) of the last crawl, or 0 if no crawl has been done yet.

  • integer LastRunEnd - End timestamp (expressed in number of milliseconds since January 1st 1970 GMT) of the last crawl, or 0 if no crawl has been done yet.

  • integer NextRun - Timestamp (expressed in number of milliseconds since January 1st 1970 GMT) of the next scheduled crawl, or 0 if no crawl is scheduled.

struct CollectionInfo

struct CollectionInfo
{
  integer Collection;
  CollectionConfiguration Configuration;
  integer Magic;
  boolean ReadOnly;
}

Used to return basic information about a collection, it has the following fields:

  • integer Collection - ID of the collection.

  • CollectionConfiguration Configuration - Configuration of the collection.

  • integer Magic - Magic number to use in SetCollectionConfiguration.

  • boolean ReadOnly - True if the user cannot change the configuration.

struct WatchInfo

struct WatchInfo
{
  integer Watch;
  WatchConfiguration Configuration;
  integer Magic;
  boolean ReadOnly;
}

Used to return basic information about a watch, it has the following fields:

  • integer Watch - ID of the watch.

  • WatchConfiguration Configuration - Configuration of the watch.

  • integer Magic - Magic number to use in SetWatchConfiguration.

  • boolean ReadOnly - True if the user cannot change the configuration.

struct MetadataTemplateInfo

struct MetadataTemplateInfo
{
  integer MetadataTemplate;
  MetadataTemplateConfiguration Configuration;
  integer Magic;
  boolean ReadOnly;
}

Used to return basic information about a metadata template, it has the following fields:

  • integer MetadataTemplate - ID of the metadata template.

  • MetadataTemplateConfiguration Configuration - Configuration of the metadata template.

  • integer Magic - Magic number to use in SetMetadataTemplateConfiguration.

  • boolean ReadOnly - True if the user cannot change the configuration.

enum PageFormat

enum PageFormat
{
  ORIGINAL,
  PARSED,
  HTML
}

Used to specify the document format to get from cache, it can have the following values:

  • ORIGINAL - The original document is returned.

  • PARSED - The parsed document, in XML format, is returned.

  • HTML - The parsed document, in HTML format, is returned.

enum Status

enum Status
{
  RUNNING,
  CRAWLSTOPPED,
  LOWDISKSPACE
}

Used to encode the status of the platform, it can have the following values:

  • RUNNING - The platform is running ok.

  • CRAWLSTOPPED - The platform is running, but due to the global crawl control the crawl of new documents is halted.

  • LOWDISKSPACE - The platform is running, but due to a low disk space condition the crawl of new documents is halted.

struct ExtendedStatus

struct ExtendedStatus
{
  string Key;
  string Value;
}

Used to pass the extended status of the platform, it has the following fields:

  • Key - The platform parameter name.

  • Value - The platform parameter value.

struct CrawlLog

struct CrawlLog
{
  integer Time;
  string Cookie;
}

Used to return basic information about a crawl log, it has the following fields:

  • integer Time - Start time of the crawl.

  • string Cookie - Cookie used to retrieve the log. Takes the "-1" value when the cookie is no more valid (i.e. the log has been fully read, or the log does not exist anymore). You must not change this value.

enum CrawlError

enum CrawlError
{
  NONE,
  REDIRECT,
  INDEX,
  NOTFOUND,
  UNPARSABLE,
  BADREDIR,
  BLOCKED,
  NETWORK,
  UNKNOWN,
  UNCHANGED,
  OUTOFDOCS,
  AUTHREQ,
  INTERNAL,
  SERVER,
  PLUGIN_ERROR,
  UNFETCHABLE,
  WORKER_FAILURE,
  WORKER_TIMEOUT
}

Used to encode the error status of a crawl log entry, it can have the following values:

  • NONE - No error, the document fetched OK.

  • REDIRECT - A redirect was found.

  • INDEX - An index was found.

  • NOTFOUND - The document was not found.

  • UNPARSABLE - The document was not parsable by the rendering engine.

  • BADREDIR - A bad redirect was found.

  • BLOCKED - The document was blocked by robots.txt exclusion rules.

  • NETWORK - A network error occurred.

  • UNKNOWN - An unknown error occurred.

  • UNCHANGED - The document did not change between consecutive fetches.

  • OUTOFDOCS - The document limit imposed by the activation key was reached.

  • AUTHREQ - The document was not accessible, authentication information must be provided to the server.

  • INTERNAL - An internal searchbox error occurred.

  • SERVER - Generic error reported from the remote server.

  • PLUGIN_ERROR - A plugin reported an error.

  • UNFETCHABLE - The protocol required to fetch the document is unknown to searchbox and no plugin supports it.

  • WORKER_FAILURE - An internal error in searchbox document processor.

  • WORKER_TIMEOUT - searchbox exceeded the time allotted to fetch and process the document. This can be due to a very large document, a slow network connection or an internal searchbox error.

struct CrawlLogData

struct CrawlLogData
{
  integer Time;
  string ID;
  string Url;
  string Description;
}

Used to return information about a crawl log entry, it has the following fields:

  • integer Time - Fetch time of the entry.

  • string ID - ID of the entry.

  • string Url - Url of the entry.

  • CrawlError Error - Error status of the entry.

  • string Description - Detailed description for the error.

Methods

AddSource

void AddSource(cfg,  
 info); 
in SourceConfiguration  cfg;
out SourceInfo  info;

Adds a new source described by the cfg configuration, and returns the SourceInfo structure in the info parameter.

RemoveSource

void RemoveSource(id);
in integer id;

Removes the source with the specified ID.

GetSourceConfiguration

void GetSourceConfiguration(id,  
 cfg,  
 magic); 
in integer  id;
out SourceConfiguration  cfg;
out integer  magic;

Returns the source configuration identified by id. The integer magic must be used when you want to modify this configuration with a call to SetSourceConfiguration (see forward) in order to avoid concurrent changes.

SetSourceConfiguration

void SetSourceConfiguration(id,  
 cfg,  
 magic,  
 info); 
in integer  id;
in SourceConfiguration  cfg;
in integer  magic;
out SourceInfo  info;

Replaces the current configuration of the source identified by id with the configuration cfg. The integer magic must be obtained by a call to GetSourceConfiguration for the same source. Returns the SourceInfo structure in the info parameter.

EnumSourceConfigurations

void EnumSourceConfigurations(infos); 
out SourceInfo[]  infos;

Returns information about all the configured sources.

AddAuth

void AddAuth(cfg,  
 id); 
in AuthConfiguration  cfg;
out integer  id;

Adds a new authentication described by the cfg configuration, and returns the ID of the new authentication in the id parameter.

RemoveAuth

void RemoveAuth(id);
in integer id;

Removes the authentication with the specified ID.

GetAuthConfiguration

void GetAuthConfiguration(id,  
 cfg,  
 magic); 
in integer  id;
out AuthConfiguration  cfg;
out integer  magic;

Returns the authentication configuration identified by id. The integer magic must be used when you want to modify this configuration with a call to SetAuthConfiguration (see forward) in order to avoid concurrent changes.

SetAuthConfiguration

void SetAuthConfiguration(id,  
 cfg,  
 magic); 
in integer  id;
in AuthConfiguration  cfg;
in integer  magic;

Replaces the current configuration of the authentication identified by id with the configuration cfg. The integer magic must be obtained by a call to GetAuthConfiguration for the same authentication.

AddArchive

void AddArchive(cfg,  
 info); 
in ArchiveConfiguration  cfg;
out ArchiveInfo  info;

Adds a new archive described by the cfg configuration, and returns the ArchiveInfo structure in the info parameter.

RemoveArchive

void RemoveArchive(id);
in integer id;

Removes the archive with the specified ID.

GetArchiveConfiguration

void GetArchiveConfiguration(id,  
 cfg,  
 magic); 
in integer  id;
out ArchiveConfiguration  cfg;
out integer  magic;

Returns the archive configuration identified by id. The integer magic must be used when you want to modify this configuration with a call to SetArchiveConfiguration (see forward) in order to avoid concurrent changes.

SetArchiveConfiguration

void SetArchiveConfiguration(id,  
 cfg,  
 magic,  
 info); 
in integer  id;
in ArchiveConfiguration  cfg;
in integer  magic;
out ArchiveInfo  info;

Replaces the current configuration of the archive identified by id with the configuration cfg. The integer magic must be obtained by a call to GetArchiveConfiguration for the same archive. Returns the ArchiveInfo structure in the info parameter.

EnumArchiveConfigurations

void EnumArchiveConfigurations(infos); 
out ArchiveInfo[]  infos;

Returns information about all the configured archives.

GetArchiveInfo

void GetArchiveInfo(id,  
 info); 
in integer  id;
out ArchiveInfo  info;

Returns information about a configured archive.

AddCollection

void AddCollection(cfg,  
 info); 
in CollectionConfiguration  cfg;
out CollectionInfo  info;

Adds a new collection described by the cfg configuration, and returns the CollectionInfo structure in the info parameter.

RemoveCollection

void RemoveCollection(id);
in integer id;

Removes the collection with the specified ID.

GetCollectionConfiguration

void GetCollectionConfiguration(