Table of Contents
This chapter describes how searchbox can be integrated in other applications. The searchbox Engine exposes its functionalities as a standard Web Service so that they can be used throught standard SOAP remote calls.
In this section you can find complete description of searchbox's SOAP interface defined by a standard WSDL.
The searchbox WSDL definition and SOAP interface are fully interoperable with the Microsoft .NET platform, Apache Software Foundation's Axis (Java) and numerous other SOAP implementations.
Complex data types defined by the interface and all methods will be described in detail.
Assuming you are running searchbox on the local computer on the default port, you can access the WSDL through an URL of the form:
http://user:password@localhost:2200/wsdl
and the SOAP endpoint (included in the WSDL) will have the form:
http://user:password@localhost:2200/soap
Methods of searchbox SOAP interface use the following complex types:
enum AccessType
{
READACCESS,
WRITEACCESS
}Used to specify the ACL entry access type, it can have the following values:
READACCESS - Read access.
WRITEACCESS - Write access.
struct AccessSpec
{
string Name;
AccessType Access;
boolean Deny;
}Used to specify an ACL entry, it has the following fields:
string Name - The name of the user or group this entry refers to.
AccessType Access - The type of access you want to grant or deny.
boolean Deny - TRUE if you want to deny the specified access, FALSE to grant it.
enum RootACLType
{
ROOTACL_BROWSING,
ROOTACL_CRAWLING
}Used to specify the root ACL type, it can have the following values:
ROOTACL_BROWSING - Specifies the Browsing root ACL. This ACL controls the rights to enumerate users and groups (READACCESS), and create new collections and watches (WRITEACCESS).
ROOTACL_CRAWLING - Specifies the Crawling root ACL. This ACL controls the rights to create new sources and archives (WRITEACCESS).
ROOTACL_PROCESSING - Specifies the Processing root ACL. This ACL controls the rights to create new metadata templates (WRITEACCESS).
enum PluginType
{
UNKNOWN_PLUGINTYPE,
ARCHIVEFFF_PLUGINTYPE,
EXTENDED_PLUGINTYPE
}Describes the type of a plugin or plugin dll
UNKNOWN_PLUGINTYPE - Unknown type; this is usually due to a buggy plugin.
ARCHIVEFFF_PLUGINTYPE - Undocumented, for backwards compatibility only.
EXTENDED_PLUGINTYPE - A focuseek plugin.
enum ExtendedPluginType
{
UNKNOWN_EXTENDED_PLUGINTYPE,
PROTOCOL_EXTENDED_PLUGINTYPE,
PARSER_EXTENDED_PLUGINTYPE,
RENDERING_EXTENDED_PLUGINTYPE
}Describes the type of an extended plugin
UNKNOWN_EXTENDED_PLUGINTYPE - Unknown type; this is usually due to a buggy plugin.
PROTOCOL_EXTENDED_PLUGINTYPE - A protocol plugin.
PARSER_EXTENDED_PLUGINTYPE - A parser plugin.
RENDERING_EXTENDED_PLUGINTYPE - A rendering plugin.
struct PluginDLLInfo
{
PluginType Type;
string Name;
string Description;
string Producer;
}Informations on a plugin dll.
PluginType Type - The type of the plugin dll.
string Name - The name of the plugin dll.
string Description - The description of the plugin dll.
string Producer - The producer of the plugin dll.
struct ExtendedPluginValue
{
string Key;
string Value;
}The name and value of an extended plugin parameter.
string Key - The parameter name.
string Value - The parameter value.
struct ExtendedPlugin
{
string Name;
string Description;
ExtendedPluginType ExtendedType;
string DLLName;
integer ParentID;
boolean FullyConfigured;
string[] UnsetConfigurationKeys;
ExtendedPluginValue[] SetConfigurationValues;
AccessSpec[] ACL;
}The configuration describing an extended plugin.
string Name - The plugin name. It is meant to be human-readable.
string Description - The plugin description. It is meant to be human-readable.
ExtendedPluginType ExtendedType - The plugin extended type. This field is filled by the plugin and is ignored when passed by the user.
string DLLName - The name of the plugin dynamic link library.
integer ParentID - The id of the parent plugin this plugin
inherits its configuration from. If this is 0
then the plugin has no parent. Note that DLLName is ignored when
ParentID is not 0.
boolean FullyConfigured - If true the plugin configuration is complete and the plugin can be used. This field is filled by the plugin and is ignored when passed by the user.
string[] UnsetConfigurationKeys - The names of the parameters the plugin requires but that are not set. This field is filled by the plugin and is ignored when passed by the user.
ExtendedPluginValue[] SetConfigurationValues - The set parameters and their values.
AccessSpec[] ACL - Access control list for this plugin.
struct ExtendedPluginInfo
{
integer ExtendedPluginID;
ExtendedPlugin ExtendedPluginConfiguration;
integer Magic;
boolean ReadOnly;
}Informations about an extended plugin.
integer ExtendedPluginID - The plugin id.
ExtendedPlugin ExtendedPluginConfiguration - The plugin configuration.
integer Magic - Magic number to use in SetExtendedPlugin.
boolean ReadOnly - If true the ACL allows the user read only access to the plugin but not write access.
struct ExtendedPluginsBundle
{
ExtendedPluginType ExtendedType;
integer[] ExtendedPluginIDs;
}Informations about an extended plugin.
ExtendedPluginType ExtendedType - The type of the plugins in this bundle.
integer[] ExtendedPluginIDs - The id of the plugins making up the bundle.
enum AuthType
{
AUTH_NONE,
AUTH_BASIC,
AUTH_COOKIE,
AUTH_SSLCERT
}Used to specify the authentication type, it can have the following values:
AUTH_NONE - No authentication.
AUTH_BASIC - Use plain username/password authentication.
AUTH_COOKIE - Use cookie authentication (only for HTTP/HTTPS).
AUTH_COOKIE - Use an SSL certificate for authentication.
enum CookieParamType
{
USERNAME,
PASSWORD,
OTHER
}Used to specify the cookie request parameter type, it can have the following values:
USERNAME - The parameter is an username.
PASSWORD - The parameter is a password.
OTHER - None of the above.
struct CookieParam
{
string Name;
string Value;
CookieParamType Type;
}Used to specify a cookie request parameter, it has the following fields:
string Name - Name of the parameter.
string Value - Value of the parameter.
CookieParamType Type - Type of the parameter.
enum CookieActionType
{
GET,
POST
}Used to specify the cookie request action type, it can have the following values:
GET - The request action will be an HTTP GET.
POST - The request action will be an HTTP POST.
struct CookieAuthParams
{
string Url;
CookieActionType Action;
integer Freshness;
CookieParam[] Params;
string[] PreUrls;
}Used to specify the cookie authentication parameters, it has the following fields:
string Url - Cookie request URL.
CookieActionType Action - Cookie request action.
integer Freshness - Cookie freshness, in seconds.
CookieParam[] Params - Cookie request parameters.
string[] PreUrls - Cookie request pre URLs. This sequence of URLs will be visited before accessing the cookie request URL, and cookies returned by the server will be accumulated and used (if applicable) in the cookie request.
struct MetadataKey
{
string Key;
integer Slice;
}Used to specify a metadata key, it has the following fields:
string Key - Metadata key.
integer Slice - Slice where metadata is placed.
struct MetadataValue
{
MetadataKey Key;
string Value;
}Used to specify a metadata value, it has the following fields:
MetadataKey Key - Metadata key.
string Value - Metadata value.
struct SourceConfiguration
{
string Name;
string Description;
string[] Seeds;
string[] IncludeFilters;
string[] ExcludeFilters;
integer DepthLimit;
boolean CheckRobotsTxt;
boolean HasSideMetadata;
AuthType Type;
CookieAuthParams CookieParams;
integer[] RendererDocFilterIDs;
ExtendedPluginsBundle[] ExtendedPluginsBundles;
string[] TextExclusionBeginMarker;
string[] TextExclusionEndMarker;
AccessSpec[] ACL;
}Used to define parameters of a source, it has the following fields:
string Name - Source name, used by client.
string Description - Source description, used by client.
string[] Seeds - List of seed URLs for crawling.
string[] IncludeFilters - Regular expressions that must be matched for an URL to be accepted by this source (including seeds).
string[] ExcludeFilters - Regular expressions that must not be matched for an URL to be accepted by this source (including seeds).
integer DepthLimit - Crawl depth limit. "0" means no depth limit. The seed is counted as 1 level, and every hyperlink followed starting from the seed URLs is counted as one additional level, so this value minus 1 is the maximum number of hyperlinks that can be followed starting from the seeds. All documents further away from the seeds will not be crawled.
boolean CheckRobotsTxt - Set to TRUE to obey robots.txt rules during crawl, set to FALSE to crawl URLs regardless of exclusions set by the web master (only for HTTP/HTTPS). Beware that avoiding to follow robots.txt rules will probably lead to a permanent ban of your IP address by the web site administrator!
boolean HasSideMetadata - Set to TRUE if you want to fetch and parse side-by-side metadata files.
AuthType Type - Type of authentication to use for crawling.
CookieAuthParams CookieParams - Cookie parameters, needed when using Cookie authentication (only for HTTP).
integer[] RendererDocFilterIDs - This field is used for backwards compatibility only.
ExtendedPluginsBundle[] ExtendedPluginsBundles - The extended plugin bundles to use for this source. Note that at most one bundle per extended plugin type is allowed for each source.
integer[] TextExclusionBeginMarker - A regular expression identifying the start of document[1] sections searchbox should completely ignore. Any text or formatting elements from a TextExclusionBeginMarker and the next TextExclusionEndMarker is ignored. The regular expression is matched against the source of the document itself.
integer[] TextExclusionEndMarker - A regular expression identifying the end of document sections searchbox should ignore. See TextExclusionBeginMarker above for more details.
AccessSpec[] ACL - Access control list for this configuration.
struct ParamValue
{
string Name;
string Value;
}Used to specify the value you want to assign to a cookie parameter, it has the following fields:
string Name - The name of the parameter. You can specify only parameters of type OTHER.
string Value - The value of the parameter.
struct AuthConfiguration
{
string Username;
string Password;
ParamValue[] ParamValues;
string SSLCertificate;
AccessSpec[] ACL;
}Used to specify an authentication configuration, it has the following fields:
string Username - Username (for Basic and Cookie authentication).
string Password - Password (for Basic and Cookie authentication).
ParamValue[] ParamValues - List of cookie parameter values (only for Cookie authentication).
string SSLCertificate - The SSL certificate to use, in Privacy-Enhanced Electronic Mail (PEM) format[2].
AccessSpec[] ACL - Access control list for this configuration.
enum Frequency
{
DISABLED,
HOURLY,
DAILY,
WEEKLY,
MONTHLY
}Used to set the base refresh frequency of an archive, it can have the following values:
DISABLED - Refresh disabled
HOURLY - Hourly refresh (every N hours)
DAILY - Daily refresh (every N days)
WEEKLY - Weekly refresh (every given day of week)
MONTHLY - Monthly refresh (every given day of month)
enum Cache
{
FULLCACHE,
CONTEXTCACHE,
NOCACHE
}Used to set the level of document caching, it can have the following values:
FULLCACHE - Retain full document cache
CONTEXTCACHE - Retain minimal cache needed for context extraction
NOCACHE - Don't retain any cache
struct ArchiveConfiguration
{
string Name;
boolean Historicize;
string AdministrativeContact;
integer Source;
integer Auth;
Frequency SchedulingFrequency;
integer SchedulingAttribute;
integer AccessTime;
integer PagesLimit;
integer GarbageLimit;
Cache DocumentCache;
integer Throttling;
AccessSpec[] ACL;
}Used to define parameters of an archive, it has the following fields:
string Name - Archive name, used by client.
boolean Historicize - Enable historicization. The value that is initially set with the AddArchive call is retained throughout all the lifecycle of the archive, i.e. you cannot change its value with the SetArchiveConfiguration call.
string AdministrativeContact - Email address of administrative contact.
integer Source - ID of the source configuration used to crawl documents for this archive.
integer Auth - ID of the authentication configuration used to crawl documents for this archive when an authentication type other than AUTH_NONE is specified in the source configuration. Use 0 if AUTH_NONE is used in the source configuration.
Frequency SchedulingFrequency - Base archive automatic refresh frequency.
integer SchedulingAttribute - Base refresh frequency multiplication factor. If the base frequency is HOURLY or DAILY, the multiplication stands for the interval in hours or day between two crawls. If the base frequency is WEEKLY, this is the day of week the crawl is intended to start on (0=Sunday, 1=Monday, etc.). If the base frequency is MONTHLY, this is the day of the month the crawl is intended to start on.
integer AccessTime - Start of crawl. If the base frequency is HOURLY it can have a value between 0 and SchedulingAttribute and sets the time of the first daily crawl, expressed as an offset from 00:00 GMT. If the base frequency is DAILY, WEEKLY or MONTHLY it can have a value between 0 and 23 and sets the GMT time of crawl.
integer PagesLimit - Maximum number of documents to fetch during a crawl session. "0" mean no limit. When during a crawl this limit is reached the crawling session is terminated.
integer GarbageLimit - Minimum age of a document before garbage collection. "0" mean no garbage collection. The age is expressed in seconds and is measured from the last time the document was fetched.
Cache DocumentCache - Caching level to be used for this archive.
integer Throttling - The minimum interval (in seconds) searchbox should wait after fetching a document for the archive before attempting to fetch another.
AccessSpec[] ACL - Access control list for this configuration.
struct CollectionConfiguration
{
string Name;
string Description;
integer[] Archives;
AccessSpec[] ACL;
}Used to define parameters of a collection, it has the following fields:
string Name - Collection name, used by client.
string Description - Collection description, used by client.
integer[] Archives - IDs of the archives that build up this collection.
AccessSpec[] ACL - Access control list for this configuration.
enum QueryView
{
VIEW_PUBLISHED,
VIEW_CORECHANGED
}Used to restrict the set of documents returned as result of a query:
VIEW_PUBLISHED - The query is applied to all the documents currently in the archive.
VIEW_CORECHANGED - The query is applied only to the documents currently in the archive that have changed in the core of the text. Only applicable to an historicizing archive.
enum QueryParser
{
NOPARSER,
RPNPARSER,
ALGPARSER,
NETPARSER
}Used to specify the query string parser to use:
NOPARSER - Don't parse, the query is submitted using QueryAtoms.
RPNPARSER - Use the RPN parser.
ALGPARSER - Use the ALG parser.
NETPARSER - Use the NET parser.
enum WatchNotificationMedia
{
MAIL_WATCH_NOTIFICATION_MEDIA,
JABBER_WATCH_NOTIFICATION_MEDIA,
SOAP_WATCH_NOTIFICATION_MEDIA
}The protocol (media) to use when sending a notification:
MAIL_NOTIFICATION_MEDIA - The notification will be sent using e-mail (SMTP).
JABBER_NOTIFICATION_MEDIA - The notification will be sent using the Jabber instant message protocol.
SOAP_NOTIFICATION_MEDIA - The notification will be sent by calling a remote web service with the SOAP protocol. See the section called “The SOAP notification service requirements” for details on the called service.
struct WatchNotificationEndpoint
{
WatchNotificationMedia Media;
string Address;
}The destination to send a notification to:
WatchNotificationMedia Media - The protocol to use when sending the notification.
string Address - The addres to send the notification to. The format of the address is media dependent; see Table 3.1, “Notification address formats”.
Table 3.1. Notification address formats
| Media | Description | |||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MAIL_WATCH_NOTIFICATION_MEDIA | An RFC 822 e-mail address, for
example support@focuseek.com. | |||||||||||||||||||||||||
| JABBER_WATCH_NOTIFICATION_MEDIA | A Jabber ID in
user@domain/resource,
for example somebody@jabber.org/client.
/resource is usually
omitted. See http://www.jabber.org/
for more details. | |||||||||||||||||||||||||
| SOAP_WATCH_NOTIFICATION_MEDIA | The SOAP endpoint to call: an http
or http URL, for example
http://localhost:8080/MyService. | |||||||||||||||||||||||||
| ACTION_WATCH_NOTIFICATION_MEDIA | In this case the address specifies the action searchbox will perform when the notification is triggered. The address is a sequence of tokens separated by a single ASCII space. The first token is always present and specifies the action to perform. |
| ||||||||||||||||||||||||
[a] For a list of valid slice ids see the section called “struct QuerySliceWeight” | ||||||||||||||||||||||||||
struct WatchNotificationTiming
{
Frequency SchedulingFrequency;
integer SchedulingAttribute;
integer AccessTime;
}Defines how oftenthe notification will be sent. Please note that a notification is never sent anyway if there are no documents in it. The structure has the following fields:
Frequency SchedulingFrequency - Base notification frequency.
integer SchedulingAttribute - Base notification frequency multiplication factor. If the base frequency is HOURLY or DAILY, the multiplication stands for the interval in hours or day between two notifications. If the base frequency is WEEKLY, this is the day of week the notification will be sent on (0=Sunday, 1=Monday, etc.). If the base frequency is MONTHLY, this is the day of the month the notification is intended to be sent.
integer AccessTime - Time of notification. If the base frequency is HOURLY it can have a value between 0 and SchedulingAttribute and sets the time of the first daily notification, expressed as an offset from 00:00 GMT. If the base frequency is DAILY, WEEKLY or MONTHLY it can have a value between 0 and 23 and sets the GMT time of the notification.
enum QueryInfo
{
INFO_NONE,
INFO_URL,
INFO_TITLE,
INFO_CONTEXT
INFO_TEMPLATE_METADATA,
INFO_ALL_METADATA
}Used to specify the information returned as result of a query:
INFO_NONE - For each result no additional info is returned.
INFO_URL - For each result the URL is returned.
INFO_TITLE - For each result the URL and the title is returned.
INFO_CONTEXT - For each result the URL, the title, the mime type and the contexts where the keywords specified into the query have been found are returned.
INFO_TEMPLATE_METADATA - The same as INFO_CONTEXT but also returns the metadata added by templates.
INFO_ALL_METADATA - The same as INFO_TEMPLATE_METADATA but returns all the metadata.
struct QuerySliceWeight
{
integer Slice;
integer Weight;
}Used to specify slice weights, it has the following fields:
integer Slice - Dict ID. The following dict IDs can be used:
1 - Author
2 - Keyword
3 - Abstract
4 - Invisible
5 - Marginal normal text
6 - Marginal emphasized text
7 - Marginal link text
8 - Marginal remote link text
9 - Marginal header text
10 - Central normal text
11 - Central emphasized text
12 - Central link text
13 - Central remote link text
14 - Central header text
15 - Title
integer Weight - Slice weight. Must be greater or equal to 0.
enum NotificationDetail
{
NOTIFICATION_DETAIL_NONE,
NOTIFICATION_DETAIL_WATCH,
NOTIFICATION_DETAIL_RESULTS
}Used to specify how much detail to send in the notification. It can have the following values:
NOTIFICATION_DETAIL_NONE - No detail is sent about the documents or the watch.
NOTIFICATION_DETAIL_WATCH - Data identifying the watch sending the notification will be included.
NOTIFICATION_DETAIL_RESULTS - In addition to the above, the notification will include data about the matched documents.
struct WatchNotificationConfiguration
{
string Name;
boolean IgnoreWatchFreshness;
long MaxNotificationResults;
WatchNotificationEndpointSeq Endpoints;
WatchNotificationTiming Timing;
QueryInfo Info;
NotificationDetail Detail;
string TitleString;
string HeaderString;
string ResultString;
string TailString;
}Used to define parameters of a watch, it has the following fields:
string Name - Notification name; it is sent to the user.
boolean IgnoreWatchFreshness - If true the notification will include all the documents acquired or changed since the last time the notification was sent. If false documents older than the watch freshness will be ignored.
long MaxNotificationResults - Maximum number of documents to include in the notification.
WatchNotificationEndpoint[] Endpoints - The enpoints to send the notification to.
WatchNotificationTiming Timing - When and how often the notification is sent.
QueryInfo Info - The info about each document the notification will include.
string TitleString - Title to use in the notification, can contain any utf8 character and some special sequences to embed results information; see Table 3.2, “Special character sequences to use in notification strings”.
string HeaderString - The header to use in the notification, can contain any utf8 character and some special sequences to embed results information; see Table 3.2, “Special character sequences to use in notification strings”.
string ResultString - Text for each document in the notification, can contain any utf8 character and some special sequences to embed results information; see Table 3.2, “Special character sequences to use in notification strings”.
string TailString - The tail to use in the notification, can contain any utf8 character and some special sequences to embed results information; see Table 3.2, “Special character sequences to use in notification strings”.
Table 3.2. Special character sequences to use in notification strings
| String | Description | Can be used |
|---|---|---|
%% | Expands to a single percent
character (%). | In any notification string |
%{wid} | Expands to the numeric id of the watch that is sending the notification. | In any notification string |
%{wname} | Expands to the name of the watch that is sending the notification. | In any notification string |
%{wurl} | Expands to the URL of the watch results page. | In any notification string |
%{rtitle} | Expands to the document title. | Only in ResultString |
%{rid} | Expands to the document ID. | Only in ResultString |
%{rctxs} | Expands to the watch contexts for the document. | Only in ResultString |
%{rurl} | Expands to the original document url. | Only in ResultString |
%{rcache} | Expands to the url of the document in searchbox cache. | Only in ResultString |
struct WatchConfiguration
{
string Name;
string Description;
string Query;
QueryView View;
QueryParser Parser;
integer Freshness;
integer Collection;
QuerySliceWeight[] Weights;
WatchNotificationConfiguration[] WatchNotifications;
AccessSpec[] ACL;
}Used to define parameters of a watch, it has the following fields:
string Name - Watch name, used by client.
string Description - Watch description, used by client.
string Query - Watch filter query.
QueryView View - The view to query.
QueryParser Parser - Watch filter query parser.
integer Freshness - Watch results minimum freshness.
integer Collection - ID of the collection monitored by this watch.
QuerySliceWeight[] Weights - The weights to use for the query.
WatchNotificationConfiguration[] WatchNotifications - The notifications sent by this watch.
AccessSpec[] ACL - Access control list for this configuration.
struct MetadataTemplateConfiguration
{
string Name;
string Description;
MetadataValue[] FixedMetadata;
MetadataKey[] VariableMetadata;
AccessSpec[] ACL;
}Used to define a metadata template, it has the following fields:
string Name - Template name, used by the client.
string Description - Template description, used by the client.
MetadataValue[] FixedMetadata - Fixed-value metadata, when the template is applied to a document these metadata will be added as-is.
MetadataKey[] VariableMetadata - Variable-value metadata, the value for the metadata may be specified when the template is applied to a document.
AccessSpec[] ACL - Access control list for this configuration.
enum QuerySort
{
SORT_STANDARD,
SORT_RELEVANCE,
SORT_SCORE,
SORT_TIME_NEWER,
SORT_TIME_OLDER
}Used to specify the sorting of documents returned as result of a query:
SORT_STANDARD - The standard sorting is used.
SORT_RELEVANCE - The documents are ordered by relevance score.
SORT_SCORE - The documents are ordered by their intrinsic score.
SORT_TIME_NEWER - The documents are ordered by change timestamp, more recently changed documents first.
SORT_TIME_OLDER - The documents are reverse-ordered by change timestamp, least recently changed documents first.
enum QueryAtomType
{
ATOM_WORD,
ATOM_WILDCARD_WORD,
ATOM_NOT,
ATOM_AND,
ATOM_OR,
ATOM_NEAR,
ATOM_META,
ATOM_META_RANGE,
ATOM_WILDCARD_META
}Used to specify the type of each QueryAtom (see forward). It can have the following values:
ATOM_WORD - QueryAtom is a keyword to find.
ATOM_WILCARD_WORD - QueryAtom is a keyword with wildcards. See the query syntax in the User Manual for details on wildcards.
ATOM_NOT - QueryAtom is a logic NOT.
ATOM_AND - QueryAtom is a logic AND between other QueryAtoms.
ATOM_OR - QueryAtom is a logic OR between other QueryAtoms.
ATOM_NEAR - QueryAtom is logic NEAR between words.
ATOM_META - QueryAtom is a specific metadata keyword and value to find.
ATOM_META_RANGE - QueryAtom is a metadata keyword to find; moreover allowed metadata values are restricted to a specified range.
ATOM_WILDCARD_META - QueryAtom is a metadata keyword to find; moreover allowed metadata values are restricted to those matching the value, which may contain wildcards.
struct QueryAtom
{
QueryAtomType Type;
string Meta;
string Param;
string Param1;
}Used to build a query in RPN notation, it has the following fields:
QueryAtomType Type - Current QueryAtom type.
string Meta - Contains the meta-keyowrd type. Only for META, META_RANGE and WILDCARD_META QueryAtoms.
string Param - If the current type is WORD, WILDCARD_WORD, META or WILDCARD_META this field contains the keyword (or metadata value) to find. Param might contain wildcards if the Type allows them. Otherwise, if the current type is AND, OR or NEAR it contains the decimal string representation of the number of QueryAtom involved in the expression. For the NOT type, only the value "1" is allowed in this field. Finally, if the atom type is META_RANGE this is the left extreme of the allowed (inclusive) range of metadata values.
string Param1 - If the atom type is META_RANGE this is the right extreme of the allowed (inclusive) range of metadata values. If the atom type is NEAR then this is the decimal string representation of a greater than zero integer, specifyind the sloppiness for the NEAR operation. The greater the sloppiness the farther NEAR looks in the documents for its arguments. A sloppiness of zero forces NEAR to look for adjacent words only. The field is unused for all the other atom types.
struct QuerySpec
{
integer[] Archives;
integer Collection;
integer Watch;
integer FirstDoc;
integer LastDoc;
integer MinTime;
integer MaxTime;
integer MinScore;
QueryInfo Info;
QueryView View;
QuerySort Sort;
QueryParser Parser;
QueryAtom[] Query;
string QueryString;
QuerySliceWeight[] Weights;
}Used to submit a query, it has the following fields:
integer[] Archives - IDs of the archives you want to query. Leave empty if you want to query a collection or a watch.
integer Collection - ID of the collection you want to query. If you want to query archives or a watch, use 0.
integer Watch - ID of the watch you want to query. If you want to query archives or a collection, use 0.
integer FirstDoc - Index of the first document (starting from 0) returned. It must be less than LastDoc.
integer LastDoc - Index of the last document (starting from 0). It must be greater than FirstDoc.
integer MinTime - Oldest Timestamp (expressed in number of seconds since January 1st 1970 GMT) of query results. All older documents will be rejected.
integer MaxTime - Newest Timestamp (expressed in number of seconds since January 1st 1970 GMT) of query results. All newer documents will be rejected.
integer MinScore - Minimum score of query results. All documents with lower score will be rejected.
QueryInfo Info - Detail level of query results.
QueryView View - Document set restrictions of query.
QuerySort Sort - Result document set sorting type.
QueryParser Parser - Parser to use to parse the query string.
QueryAtom[] Query - Query in RPN notation all list of QueryAtoms. The QueryAtom sequence must produce a stack with only one element. Only used if NOPARSER is specified as QueryParser.
string QueryString - Query string to be parsed. Only used if RPNPARSER, ALGPARSER or NETPARSER is specified as QueryParser.
QuerySliceWeight[] Weights - Slice weights. You can pass an empty vector to use the default slice weights. To disable a slice in the current query, you must pass an entry for the slice with a weight of 0. If you don't pass an entry for a certain slice, that slice will have its default weight.
struct QueryResult
{
string ID;
string Url;
string Title;
string MimeType;
string[] Contexts;
integer Timestamp;
integer Score;
integer[] Archives;
MetadataValue[] Metadata;
long[] Templates;
long CoreTextID;
}Used to return information of a query result, it has the following fields:
string ID - ID of the document, guaranteed to be unique across archives.
string Url - URL of the document.
string Title - Title of the document.
string MimeType - Mime type of the document.
string[] Contexts - Contexts of the document where the keywords have been found.
integer Timestamp - Document timestamp (expressed in number of milliseconds since January 1st 1970 GMT).
integer Score - Document score (expressed as percentage * 10000).
integer[] Archives - The archives the document belongs to.
MetadataValue[] Metadata - The document metadata. Filled only if the query has detail INFO_TEMPLATE_METADATA or better. Note that while INFO_ALL_METADATA returns all document metadata INFO_TEMPLATE_METADATA returns only metadata added by templates.
long[] Templates - The numeric IDs of the templates applied to the document.
long CoreTextID - Documents with the same core text have the same CoreTextID.
struct SourceInfo
{
integer Source;
SourceConfiguration Configuration;
integer Magic;
boolean ReadOnly;
}Used to return basic information about a source, it has the following fields:
integer Source - ID of the source.
SourceConfiguration Configuration - Configuration of the source.
integer Magic - Magic number to use in SetSourceConfiguration.
boolean ReadOnly - True if the user cannot change the configuration.
struct ArchiveInfo
{
integer Archive;
ArchiveConfiguration Configuration;
integer Magic;
boolean ReadOnly;
boolean Crawling;
integer Documents;
integer Errors;
integer LastRunStart;
integer LastRunEnd;
integer NextRun;
}Used to return basic information about an archive, it has the following fields:
integer Archive - ID of the archive.
ArchiveConfiguration Configuration - Configuration of the archive.
integer Magic - Magic number to use in SetArchiveConfiguration.
boolean ReadOnly - True if the user cannot change the configuration.
boolean Crawling - TRUE if crawling is in progress.
integer Documents - Number of documents crawled in the current crawl (if crawling is in progress) or in the last completed crawl.
integer Errors - Number of crawl errors in the current crawl (if crawling is in progress) or in the last completed crawl.
integer LastRunStart - Begin timestamp (expressed in number of milliseconds since January 1st 1970 GMT) of the last crawl, or 0 if no crawl has been done yet.
integer LastRunEnd - End timestamp (expressed in number of milliseconds since January 1st 1970 GMT) of the last crawl, or 0 if no crawl has been done yet.
integer NextRun - Timestamp (expressed in number of milliseconds since January 1st 1970 GMT) of the next scheduled crawl, or 0 if no crawl is scheduled.
struct CollectionInfo
{
integer Collection;
CollectionConfiguration Configuration;
integer Magic;
boolean ReadOnly;
}Used to return basic information about a collection, it has the following fields:
integer Collection - ID of the collection.
CollectionConfiguration Configuration - Configuration of the collection.
integer Magic - Magic number to use in SetCollectionConfiguration.
boolean ReadOnly - True if the user cannot change the configuration.
struct WatchInfo
{
integer Watch;
WatchConfiguration Configuration;
integer Magic;
boolean ReadOnly;
}Used to return basic information about a watch, it has the following fields:
integer Watch - ID of the watch.
WatchConfiguration Configuration - Configuration of the watch.
integer Magic - Magic number to use in SetWatchConfiguration.
boolean ReadOnly - True if the user cannot change the configuration.
struct MetadataTemplateInfo
{
integer MetadataTemplate;
MetadataTemplateConfiguration Configuration;
integer Magic;
boolean ReadOnly;
}Used to return basic information about a metadata template, it has the following fields:
integer MetadataTemplate - ID of the metadata template.
MetadataTemplateConfiguration Configuration - Configuration of the metadata template.
integer Magic - Magic number to use in SetMetadataTemplateConfiguration.
boolean ReadOnly - True if the user cannot change the configuration.
enum PageFormat
{
ORIGINAL,
PARSED,
HTML
}Used to specify the document format to get from cache, it can have the following values:
ORIGINAL - The original document is returned.
PARSED - The parsed document, in XML format, is returned.
HTML - The parsed document, in HTML format, is returned.
enum Status
{
RUNNING,
CRAWLSTOPPED,
LOWDISKSPACE
}Used to encode the status of the platform, it can have the following values:
RUNNING - The platform is running ok.
CRAWLSTOPPED - The platform is running, but due to the global crawl control the crawl of new documents is halted.
LOWDISKSPACE - The platform is running, but due to a low disk space condition the crawl of new documents is halted.
struct ExtendedStatus
{
string Key;
string Value;
}Used to pass the extended status of the platform, it has the following fields:
Key - The platform parameter name.
Value - The platform parameter value.
struct CrawlLog
{
integer Time;
string Cookie;
}Used to return basic information about a crawl log, it has the following fields:
integer Time - Start time of the crawl.
string Cookie - Cookie used to retrieve the log. Takes the "-1" value when the cookie is no more valid (i.e. the log has been fully read, or the log does not exist anymore). You must not change this value.
enum CrawlError
{
NONE,
REDIRECT,
INDEX,
NOTFOUND,
UNPARSABLE,
BADREDIR,
BLOCKED,
NETWORK,
UNKNOWN,
UNCHANGED,
OUTOFDOCS,
AUTHREQ,
INTERNAL,
SERVER,
PLUGIN_ERROR,
UNFETCHABLE,
WORKER_FAILURE,
WORKER_TIMEOUT
}Used to encode the error status of a crawl log entry, it can have the following values:
NONE - No error, the document fetched OK.
REDIRECT - A redirect was found.
INDEX - An index was found.
NOTFOUND - The document was not found.
UNPARSABLE - The document was not parsable by the rendering engine.
BADREDIR - A bad redirect was found.
BLOCKED - The document was blocked by robots.txt exclusion rules.
NETWORK - A network error occurred.
UNKNOWN - An unknown error occurred.
UNCHANGED - The document did not change between consecutive fetches.
OUTOFDOCS - The document limit imposed by the activation key was reached.
AUTHREQ - The document was not accessible, authentication information must be provided to the server.
INTERNAL - An internal searchbox error occurred.
SERVER - Generic error reported from the remote server.
PLUGIN_ERROR - A plugin reported an error.
UNFETCHABLE - The protocol required to fetch the document is unknown to searchbox and no plugin supports it.
WORKER_FAILURE - An internal error in searchbox document processor.
WORKER_TIMEOUT - searchbox exceeded the time allotted to fetch and process the document. This can be due to a very large document, a slow network connection or an internal searchbox error.
struct CrawlLogData
{
integer Time;
string ID;
string Url;
string Description;
}Used to return information about a crawl log entry, it has the following fields:
integer Time - Fetch time of the entry.
string ID - ID of the entry.
string Url - Url of the entry.
CrawlError Error - Error status of the entry.
string Description - Detailed description for the error.
void AddSource( | cfg, | |
info); |
| in SourceConfiguration | cfg; |
| out SourceInfo | info; |
Adds a new source described by the cfg
configuration, and returns the SourceInfo structure in the
info parameter.
void
GetSourceConfiguration( | id, | |
| cfg, | ||
magic); |
| in integer | id; |
| out SourceConfiguration | cfg; |
| out integer | magic; |
Returns the source configuration identified by
id. The integer magic
must be used when you want to modify this configuration with a call to
SetSourceConfiguration (see forward) in order to avoid concurrent
changes.
void
SetSourceConfiguration( | id, | |
| cfg, | ||
| magic, | ||
info); |
| in integer | id; |
| in SourceConfiguration | cfg; |
| in integer | magic; |
| out SourceInfo | info; |
Replaces the current configuration of the source identified by
id with the configuration
cfg. The integer magic
must be obtained by a call to GetSourceConfiguration for the same
source. Returns the SourceInfo structure in the
info parameter.
void
EnumSourceConfigurations( | infos); |
| out SourceInfo[] | infos; |
Returns information about all the configured sources.
void AddAuth( | cfg, | |
id); |
| in AuthConfiguration | cfg; |
| out integer | id; |
Adds a new authentication described by the
cfg configuration, and returns the ID of the
new authentication in the id parameter.
void GetAuthConfiguration( | id, | |
| cfg, | ||
magic); |
| in integer | id; |
| out AuthConfiguration | cfg; |
| out integer | magic; |
Returns the authentication configuration identified by
id. The integer magic
must be used when you want to modify this configuration with a call to
SetAuthConfiguration (see forward) in order to avoid concurrent
changes.
void SetAuthConfiguration( | id, | |
| cfg, | ||
magic); |
| in integer | id; |
| in AuthConfiguration | cfg; |
| in integer | magic; |
Replaces the current configuration of the authentication
identified by id with the configuration
cfg. The integer magic
must be obtained by a call to GetAuthConfiguration for the same
authentication.
void AddArchive( | cfg, | |
info); |
| in ArchiveConfiguration | cfg; |
| out ArchiveInfo | info; |
Adds a new archive described by the cfg
configuration, and returns the ArchiveInfo structure in the
info parameter.
void
GetArchiveConfiguration( | id, | |
| cfg, | ||
magic); |
| in integer | id; |
| out ArchiveConfiguration | cfg; |
| out integer | magic; |
Returns the archive configuration identified by
id. The integer magic
must be used when you want to modify this configuration with a call to
SetArchiveConfiguration (see forward) in order to avoid concurrent
changes.
void
SetArchiveConfiguration( | id, | |
| cfg, | ||
| magic, | ||
info); |
| in integer | id; |
| in ArchiveConfiguration | cfg; |
| in integer | magic; |
| out ArchiveInfo | info; |
Replaces the current configuration of the archive identified by
id with the configuration
cfg. The integer magic
must be obtained by a call to GetArchiveConfiguration for the same
archive. Returns the ArchiveInfo structure in the
info parameter.
void
EnumArchiveConfigurations( | infos); |
| out ArchiveInfo[] | infos; |
Returns information about all the configured archives.
void GetArchiveInfo( | id, | |
info); |
| in integer | id; |
| out ArchiveInfo | info; |
Returns information about a configured archive.
void AddCollection( | cfg, | |
info); |
| in CollectionConfiguration | cfg; |
| out CollectionInfo | info; |
Adds a new collection described by the
cfg configuration, and returns the
CollectionInfo structure in the info
parameter.
void RemoveCollection(id);
in integer id;
Removes the collection with the specified ID.