Table of Contents
This chapter describes in detail all use cases regarding searchbox publishing issues. The involved basic concepts are:
Collections
Watches
It also describes the searchbox Enterprise Search Portal, the embedded web-based applicaton to browse collections exploiting searchbox metadata grouping features.
The aim of a Collection is to aggregate many Archives in order to create a unique point of query for them.
Select the Collection Tab of the left pane.
If your searchbox has no user configured Collections clicking the Collection Tab on the left panel only the default All Archive Collection (id = -1) is shown.
All options of the Info Tab are not editable because this is the default collection that aggregate all configured Archives.
Clicking the button at the bottom of
the left panel a new Collection is created with the
(new) default name. The basic configuration of
a new Collection consists in setting a Name and a
Description and toggling on Archives you want to
aggregate into it.
The query on a Collection is exactly the same as that on the Archive.
The only difference is the button. It creates a Watch with the current query as filter (see Creating a new Watch section).
A searchbox query is made up of words and metadata separated by operators. Words are simply English (or any other language) words searchbox should look for in the documents; they appear as simple words in the query, as shown in Example 11.1, “A query containing a single word”.
Example 11.1. A query containing a single word
searchbox
This query will return all the documents containing the word
searchbox.
Moreover, as detailed in Example 11.2, “Some queries using wildcards” you can use
wildcards when querying for words. Searchbox
supports two wildcards: * (star or asterisk) and
? (question mark). The question mark means “any
single character is allowed here” while the asterisk means “any sequence
of any number of characters (including no characters at all) is allowed
here”.
Example 11.2. Some queries using wildcards
search*
This query will return all the documents containing words
starting with search, such as
search, searching or
searchbox.
search??
This query will return all the documents containing words
starting with search followed by exactly two
characters, such as searches or
searched.
searchbox allows you to place a wildcard nearly anywhere in a word and not only at the end, as detailed in Example 11.3, “Some complex wildcards queries”; the only limitation is that the very first character of each word must be a real character and not a wildcard. Moreover you can use multiple wildcards in a word.
Using wildcards near the beginning of a word can be highly inefficient both in terms of query response time and resources exacted on searchbox by the query itself. In some extreme cases searchbox might not be able to perform your query at all.
While words are, well, simple words, metadata are instead made of two parts: their key and their value; the key is the metadata name, so to say. You can ask for all the documents written in English with the query in Example 11.4, “Querying for specific metadata”
Example 11.4. Querying for specific metadata[12]
language:en
Returns all the document labeled with metadata
language with value
en.
You can also query for documents containing metadata whose values belong to a specific range or even perform wildcard searches on the metadata value just as if it were a simple word. This is detailed in Example 11.5, “Selecting documents based on a range of metadata values ”
Example 11.5. Selecting documents based on a range of metadata values [13]
authordate:20050101:20051231
Returns all the documents labeled with metadata
authordate with values in the range
20050101-20051231
inclusive.
authordate:20050101:
Returns all the documents labeled with metadata
authordate authored since and including January
1st, 2005.
authordate::20050101
Returns all the documents labeled with metadata
authordate authored up to and including January
1st, 2005.
authordate::????01*
Returns all the documents authored on January of any year.
Anywhere you can use a simple word in a searchbox query you can also use a metadata.
There are two other important points in metadata queries, both related to the way metadata are indexed. These are discussed in the section called “Some final points on syntax”.
As stated above you are not limited to a single word or metadata
in your queries: you can combine them with
operators. The simplest operator is the
phrase search (Example 11.6, “Phrase search”) which lets you search for a “phrase”
(i.e. a sequence of adjacent words) instead of for a single word. Phrase
search is triggered by surrounding words with a pair of double quotes
(").
Example 11.6. Phrase search
"focuseek searchbox features"
Returns all the documents containing the phrase
focuseek searchbox features.
In fact the phrase search has an optional numeric parameter: the
sloppiness. It is specified by following the
closing double quote with a tilde (~) and a positive integer, the so
called sloppiness value. No spaces are allowed
between the quote and the tilde nor between the tilde and the integer.
Valid values for the sloppiness are in the range from
0 to 2147483648. The
sloppiness is the edit distance, i.e. maximum the
number of word insertions and moves that searchbox will tolerate in the
matched text when looking for the query phrase.
Example 11.7. Phrase search with sloppyness
"focuseek features"~1
Returns all the documents containing the phrase
focuseek features but also all the documents
containing focuseek searchbox features,
focuseek important features as well as any
other document containing the word focuseek
followed by features with exacly one word
between them.
"focuseek features"~2
Returns all the documents mentioned in the previous example and
also documents where the words focuseek and
features are separated by
two intervening words. Moreover it returns all
the documents containing the phrase features
focuseek[14].
Note that a phrase search with a sloppines value of zero is equivalent to conventional phrase search.
You can compose even more complex queries using the so called
boolean operators. There are three boolean
operators in searchbox: & (an ampersand
character, meaning “this word and this other
word”[15] or simply and),
| (a vertical pipe character, meaning “this word
or this other word” or simply
or) and ! (an exclamation
point, meaning “not this word“ or simply
not). The first two are infix,
binary operators, meaning that they accept two
arguments and are placed after the first argument and before the second
one, just as the regular algebra multiplication and addition operations.
The not operator instead is
prefix and unary, meaning that
it accepts a single argument and the operator itself comes before the
argument, just as the minus sign used for negative numbers in regular
algebra. The or operator means you are looking for
documents containing either one or both its arguments. The
and operator means you are looking for documents
containing both its arguments. The not operator
means you are looking for documents which don't contain its only
argument. Finally, you can omit blanks surrounding the operators
[16] and the and operator (which is by far
the most commonly used) can be omitted; thus focuseek &
searchbox is the same as focuseek
searchbox[17]. A couple of examples (Example 11.8, “Some simple boolean queries”) will help make things clear.
Example 11.8. Some simple boolean queries
searchbox & focuseek
Returns the documents containing both the words
searchbox and
focuseek in any order or
position.
searchbox & focuseek
Returns the documents containing both the words
searchbox and
focuseek in any order or
position.
searchbox focuseek
Shorthand for searchbox &
focuseek.
searchbox & focuseek & price
Returns the documents containing all the words
searchbox,
focuseek and
price in any order or
position.
searchbox focuseek price
Shorthand for searchbox & focuseek &
price.
searchbox | focuseek
Returns the documents containing either or both the words
searchbox and
focuseek in any order or
position.
searchbox | focuseek | search
Returns the documents containing any one of the words
searchbox,
focuseek or
search in any order or
position.
!searchbox
Returns the documents that don't contain the word
searchbox.
focuseek & !searchbox
Returns the documents that contain the word
focuseek and don't contain the word
searchbox.
"focuseek searchbox" | "search tools"
Returns the documents that contain either or both the phrase
focuseek searchbox and the phrase
search tools.
focuseek & authordate:20050101:
Returns the documents that contain the word
focuseek authored since January the
1st, 2005.
The not operator is usually costlier than or and or is usually costlier than and. Moreover using only a not operator in a query may lead to lots of results, as the word you are not requiring might appear only in a tiny fraction of your documents.
In searchbox queries parentheses have the same meaning they have in regular algebra: they group things and help specifying what to do first. You can surround any word or any operator and its arguments with parentheses: the whole parenthesized expression will be considered as it was a single word.
Example 11.9. Using parentheses
(focuseek & searchbox) | (search & tools)
Returns the documents containing the both the words
focuseek and
searchbox or both the words
search and
tools.
(focuseek | "search tool") & price
Returns the documents containing the word
focuseek or the phrase
search tool that also contain the
word price.
((focuseek | "search tool") & price) & authordate:20050101:
Returns the documents containing the word
focuseek or the phrase
search tool that also contain the
word price and which were authored since January 1st, 2005.
You might occasionally need to search for a character searchbox
uses in its query syntax (e.g. & or
:). This most usually occurs with metadata values
as searchbox, as shipped, doesn't allow these characters into words. In
order to do this you can prefix the character with a single
\ (backslash character).
Should you need to search for a backslash character simply insert two
consecutive backslash characters.
Example 11.10. Using special characters in the query
pubdate:2005\:01\:01:2006\:12\:31
Search for all the documents marked with the metadata key
pubdate and with values in the range
2005:01:01 up to
2006:12:31 including the
boundaries.
Now we get back to metadata queries. As described in the section called “Gathering "side metadata"” metadata passed to searchbox can be marked to be indexed exactly or to be normalized, tokenized or both. This influences your queries in two ways.
First, while "normal" text search in searchbox is case
insensitive, metadata search is always case
sensitive, both for key and value; thus
authordate:20050101 and
Authordate:20050101 return different
results. However you can tell searchbox that you want to index a
specific metadata in normalized form, meaning
searchbox will store the metadata value in the same way it stores normal
text[18]. Thus if you store normalized metadata you can query them
using lowercase values.
Second, but more important, metadata values are
not split into words unless you specify to index
the metadata as tokenized. Thus, if your document
is marked by the metadata key:val1 val2
(i.e. a single metadata with a value containing a space) queries for
key:val1 or
key:val2 will return no results. See
also Example 11.11, “Querying tokenized and normalized metadata”.
Example 11.11. Querying tokenized and normalized metadata
In this example we will assume searchbox contains the following documents (in pseudo-FFF format):
<fff id="docA"> <!-- A lot of stuff is omitted here --!> <meta key="key1">value1 value2<meta> <meta key="key2" normalized="1">value1<meta> </fff> <fff id="docB"> <!-- A lot of stuff is omitted here --!> <meta key="keyt1" tokenized="1">value1 value2<meta> <meta key="key2" normalized="1">VALUE1<meta> </fff> <fff id="docC"> <!-- A lot of stuff is omitted here --!> <meta key="key1">value0 value1 value2 value3<meta> <meta key="key2">value1<meta> </fff> <fff id="docD"> <!-- A lot of stuff is omitted here --!> <meta key="keyt1" tokenized="1">value0 value1 value11 value2<meta> <meta key="key2">VALUE1<meta> </fff>
Let's go on and look at some queries:
key1:value1
Returns docB and
docD: it looks for documents
containing the "word" value1 in
metadata key1.
key1:value1 key1:value2
Returns docB and
docD: it looks for documents
containing both the "words" value1
and value2 in metadata
key1.
"key1:value1 key1:value2"
Returns only docB: looks for
documents containing a phrase made up exactly of the "words"
value1 and
value2 in metadata
key1.
key1:value1\ value2
Returns only docA: looks for
documents containing the single "word"
value1 value2 (note the blank
separating value1 and
value2) in metadata
key1.
key1:value1\ value11
Returns nothing: no document contains the "word"
value1 value11 in metadata
key1.
docD contains the "word"
value0 value1 value11 value2 but it
is not the same exact word we are looking for.
"key1:value0 value2"
Returns nothing: no document contains the exact "phrase"
value0 value2 in metadata
key1.
docD contains these "words" but they
are interspersed with other words.
key2:VALUE1
Returns only docD.
key2:value1
Returns docA,
docB and
docC.
Select the Collection Tab of the left pane.
Clicking on the button a windows popup to choose the collection whit the new Watch have to refer to.
Each Watch can be associated to only one Collection at time. If you need to monitor Archives which are collected in different Collections you have to create a new Collection for this purpose.
In order to complete the basic configuration of a Watch some fields of the Info Tab of the configuration panel must be filled.
Into the General box:
Name: is the short name of the Watch.
Description: is a brief description of the Watch
Filter: is the query that Watch uses to filter documents of its corresponding Collection. The button let you modify weights of slices exactly as in the corresponding Browsing Tab of Archives and Collections.
View modified docs: let you choose to view all changed documents or only those which have the core text changed.
Sort: by default the Watch shows results ordered by their timestamp. searchbox 2.2 introduces for Watches other sorting methods: , and . The Default entry is the same as Newer.
Into the Fresheness box of Info Tab of Watch configuration:
Newer than: it sets the maximum age that documents must have to be returned by the Watch.
By default it is set as 7 days.
The Notification Tab shows all active notificators for the current Watch. Clicking the button located at the bottom. a new window appears.
Using this panel is possible to setup a notification process that warn a user about new documents that are gathered by searchbox and match the watch filter.
In the General Tab there are the following fields:
Name: the notificator name
Timing: when the notification should be sent
Ignore watch freshness: to override the corresponding watch preference
Query detail: detail level of the query performed by the watch (for SOAP notification only). Possible values are:
. No details.
. Only URL.
. Title too.
. Context too.
. Template metadata too.
. All details.
Notify detail: detail level of the query performed by the watch (all other notifications). Possible values are:
.
.
.
From the Recipients Tab, clicking on button is possible to add a new entry to the current list (initially empty) of recipients for the notificator.
Each recipients must have an Address and a Media by which he/she want to receive the notification from the watch. The Address: field must be valid for the chosen media. Possibles values for Media: field are:
. An email to a specified address in the Address: field.
. An IM message to a Jabber account specified in the Address: field.
. A callback to a Web Service. The endpoint must be specified in the Address: field.
In order to test if a notificator has been correctly configured it can be immediately tested using the button in the Info Tab of Watch configuration panel.
Browsing Watch results is very similar to making a query on an Archive or Collection apart that the default filter is automatically applied.
Differences are:
The button concatenate the content of the text box with the default filter with the AND operand.
With the Freshness: drop-down is possible to change the age limit of reported documents. This parameter is valid only for the current browsing session.
By default searchbox let you access to Watch results as an RSS stream. In the Notifications Tab of Watch configuration panel the button copy the url of the RSS stream associated with Watch results into the clipboard so that it can be pasted into an external RSS aggregator.
In the following picture the RSS configuration window of Mozilla Thunderbird.
Starting from release 2.2 searchbox embeds the Enterprise Search Portal (ESP) Ajax application to easily browse documents into collection using both full text and associated metadata. Such application has been implemented to provide a very interactive document search anche browsing experience within your preferred browser.
From Wikipedia:
"Ajax, shorthand for Asynchronous JavaScript and XML, is a web development technique for creating interactive web applications. The intent is to make web pages feel more responsive by exchanging small amounts of data with the server behind the scenes, so that the entire web page does not have to be reloaded each time the user requests a change. This is meant to increase the web page's interactivity, speed, and usability."
After a query the ESP appears as shown in the following picture:
ESP is a great tool to browse huge document archives gathered into searchbox Collections. It lets you perform incremental queries, group and sort results with various criteria.
The ESP is accessible from any modern Jacascript enabled Web browser like IE 6, Firefox 1.5 or Safari 2 at the address:
http://<hostname>:<portnumber>
where hostname is
localhost and portnumber is
2200 if you are working on the same machine where
searchbox is installed on.
ESP is accessible both as anonymous and authorized user.
ESP can browse groups of archives that are previously organized in collection from the Control Panel. From the Where dropdown you can choose the collection where ESP will perform the query.
ESP shows results while you type into the search box starting from the fourth character. This feature is active only for single-word query, in case you use more concatenated words a standard query is performed. The incremental query can also be finalized pressing Return on the keyboard.
ESP can group query results using specific document metadata. Such metadata are automatically provided by embedded document analyzers. The list of user's available grouping methods can vary and can be configured from the Control Panel in to the Collection Tab.
Within each group of results ESP can use various sorting criteria. The Relevance criteria sorts results according the internal searchbox relevance algorithm. Other available methods are Time (sorting by increasing or decreasing timestamp) and MIME (sorting by document format) but more in general ESP can perform grouping by any metadata injected into documents by plugins and/or metadata templates.
ESP can also return results within a specific time interval. This control let users choosing the desidered time interval starting from the current date in a very immediate way. The document timestamp is set by default as equal to time when the document is fetched from the source.
Each group of results can be expanded or collapsed using the rotating arrow located at the left side of any group header bar. ESP automatically decide to show some groups as collapsed in order to show all available groups in a single page.
[12] This examples actually works only if you enabled the SLR plugin in your searchbox source.
[13] This assumes your documents are labeled with a metadata with
key authordate and a value
encoding their authoring date as YYYYMMDD.
[14] Swapping two words counts as two move operations: one for each word.
[15] Actually the arguments to boolean operators are not limited to words but can be any parenthesized expression. For the sake of simplicity we will use only simple words now and explain the full syntax later.
[16] For clarity we will not omit them in our examples.
[17] But not the same as "focuseek
searchbox" which will insist that both words are
adjacent in the document
[18] More or less this implies that the metadata value is stored in lowercase. In addition some other manipulations are made to avoid some ambiguities in the way characters can be represented in unicode.