Chapter 11. Publishing

Table of Contents

Creating a new Collection
Making a query on a Collection
The query syntax
Words
Metadata
Query operators
Combining queries using parentheses
Some final points on syntax
Creating a new Watch
Setting Watch freshness
Creating a new notificator for Watch
Browsing Watch results
Viewing Watch results into an RSS aggregator
The searchbox Enterprise Search Portal (ESP)
Accessing to ESP
Make a query
Group results by different criteria
Sort query results
Set a time window for query results
Expand & Collapse groups
Getting more info

This chapter describes in detail all use cases regarding searchbox publishing issues. The involved basic concepts are:

It also describes the searchbox Enterprise Search Portal, the embedded web-based applicaton to browse collections exploiting searchbox metadata grouping features.

Creating a new Collection

The aim of a Collection is to aggregate many Archives in order to create a unique point of query for them.

Select the Collection Tab of the left pane.

Figure 11.1. The Collections Tab

The Collections Tab

If your searchbox has no user configured Collections clicking the Collection Tab on the left panel only the default All Archive Collection (id = -1) is shown.

Figure 11.2. The built-in Collection

The built-in Collection

All options of the Info Tab are not editable because this is the default collection that aggregate all configured Archives.

Clicking the New... button at the bottom of the left panel a new Collection is created with the (new) default name. The basic configuration of a new Collection consists in setting a Name and a Description and toggling on Archives you want to aggregate into it.

Figure 11.3. A new Collection

A new Collection

Making a query on a Collection

The query on a Collection is exactly the same as that on the Archive.

Figure 11.4. Browsing Collection content

Browsing Collection content

The only difference is the Save as watch button. It creates a Watch with the current query as filter (see Creating a new Watch section).

The query syntax

Words

A searchbox query is made up of words and metadata separated by operators. Words are simply English (or any other language) words searchbox should look for in the documents; they appear as simple words in the query, as shown in Example 11.1, “A query containing a single word”.

Example 11.1. A query containing a single word

searchbox

This query will return all the documents containing the word searchbox.

Moreover, as detailed in Example 11.2, “Some queries using wildcards” you can use wildcards when querying for words. Searchbox supports two wildcards: * (star or asterisk) and ? (question mark). The question mark means “any single character is allowed here” while the asterisk means “any sequence of any number of characters (including no characters at all) is allowed here”.

Example 11.2. Some queries using wildcards

search*

This query will return all the documents containing words starting with search, such as search, searching or searchbox.

search??

This query will return all the documents containing words starting with search followed by exactly two characters, such as searches or searched.

searchbox allows you to place a wildcard nearly anywhere in a word and not only at the end, as detailed in Example 11.3, “Some complex wildcards queries”; the only limitation is that the very first character of each word must be a real character and not a wildcard. Moreover you can use multiple wildcards in a word.

Note

Using wildcards near the beginning of a word can be highly inefficient both in terms of query response time and resources exacted on searchbox by the query itself. In some extreme cases searchbox might not be able to perform your query at all.

Example 11.3. Some complex wildcards queries

s*r*

Matches the documents containing search, super, superfluous, etc.

s??r*

Matches the documents containing search, searching, supreme, stir, etc.

s*r

Matches super, star, etc.

Metadata

While words are, well, simple words, metadata are instead made of two parts: their key and their value; the key is the metadata name, so to say. You can ask for all the documents written in English with the query in Example 11.4, “Querying for specific metadata”

Example 11.4. Querying for specific metadata[12]

language:en

Returns all the document labeled with metadata language with value en.

You can also query for documents containing metadata whose values belong to a specific range or even perform wildcard searches on the metadata value just as if it were a simple word. This is detailed in Example 11.5, “Selecting documents based on a range of metadata values ”

Example 11.5. Selecting documents based on a range of metadata values [13]

authordate:20050101:20051231

Returns all the documents labeled with metadata authordate with values in the range 20050101-20051231 inclusive.

authordate:20050101:

Returns all the documents labeled with metadata authordate authored since and including January 1st, 2005.

authordate::20050101

Returns all the documents labeled with metadata authordate authored up to and including January 1st, 2005.

authordate::????01*

Returns all the documents authored on January of any year.

Note

Anywhere you can use a simple word in a searchbox query you can also use a metadata.

There are two other important points in metadata queries, both related to the way metadata are indexed. These are discussed in the section called “Some final points on syntax”.

Query operators

As stated above you are not limited to a single word or metadata in your queries: you can combine them with operators. The simplest operator is the phrase search (Example 11.6, “Phrase search”) which lets you search for a “phrase” (i.e. a sequence of adjacent words) instead of for a single word. Phrase search is triggered by surrounding words with a pair of double quotes (").

Example 11.6. Phrase search

"focuseek searchbox features"

Returns all the documents containing the phrase focuseek searchbox features.

In fact the phrase search has an optional numeric parameter: the sloppiness. It is specified by following the closing double quote with a tilde (~) and a positive integer, the so called sloppiness value. No spaces are allowed between the quote and the tilde nor between the tilde and the integer. Valid values for the sloppiness are in the range from 0 to 2147483648. The sloppiness is the edit distance, i.e. maximum the number of word insertions and moves that searchbox will tolerate in the matched text when looking for the query phrase.

Example 11.7. Phrase search with sloppyness

"focuseek features"~1

Returns all the documents containing the phrase focuseek features but also all the documents containing focuseek searchbox features, focuseek important features as well as any other document containing the word focuseek followed by features with exacly one word between them.

"focuseek features"~2

Returns all the documents mentioned in the previous example and also documents where the words focuseek and features are separated by two intervening words. Moreover it returns all the documents containing the phrase features focuseek[14].

Note that a phrase search with a sloppines value of zero is equivalent to conventional phrase search.

You can compose even more complex queries using the so called boolean operators. There are three boolean operators in searchbox: & (an ampersand character, meaning “this word and this other word”[15] or simply and), | (a vertical pipe character, meaning “this word or this other word” or simply or) and ! (an exclamation point, meaning “not this word“ or simply not). The first two are infix, binary operators, meaning that they accept two arguments and are placed after the first argument and before the second one, just as the regular algebra multiplication and addition operations. The not operator instead is prefix and unary, meaning that it accepts a single argument and the operator itself comes before the argument, just as the minus sign used for negative numbers in regular algebra. The or operator means you are looking for documents containing either one or both its arguments. The and operator means you are looking for documents containing both its arguments. The not operator means you are looking for documents which don't contain its only argument. Finally, you can omit blanks surrounding the operators [16] and the and operator (which is by far the most commonly used) can be omitted; thus focuseek & searchbox is the same as focuseek searchbox[17]. A couple of examples (Example 11.8, “Some simple boolean queries”) will help make things clear.

Example 11.8. Some simple boolean queries

searchbox & focuseek

Returns the documents containing both the words searchbox and focuseek in any order or position.

searchbox & focuseek

Returns the documents containing both the words searchbox and focuseek in any order or position.

searchbox focuseek

Shorthand for searchbox & focuseek.

searchbox & focuseek & price

Returns the documents containing all the words searchbox, focuseek and price in any order or position.

searchbox focuseek price

Shorthand for searchbox & focuseek & price.

searchbox | focuseek

Returns the documents containing either or both the words searchbox and focuseek in any order or position.

searchbox | focuseek | search 

Returns the documents containing any one of the words searchbox, focuseek or search in any order or position.

!searchbox

Returns the documents that don't contain the word searchbox.

focuseek & !searchbox

Returns the documents that contain the word focuseek and don't contain the word searchbox.

"focuseek searchbox" | "search tools"

Returns the documents that contain either or both the phrase focuseek searchbox and the phrase search tools.

focuseek & authordate:20050101:

Returns the documents that contain the word focuseek authored since January the 1st, 2005.

Note

The not operator is usually costlier than or and or is usually costlier than and. Moreover using only a not operator in a query may lead to lots of results, as the word you are not requiring might appear only in a tiny fraction of your documents.

Combining queries using parentheses

In searchbox queries parentheses have the same meaning they have in regular algebra: they group things and help specifying what to do first. You can surround any word or any operator and its arguments with parentheses: the whole parenthesized expression will be considered as it was a single word.

Example 11.9. Using parentheses

(focuseek & searchbox) | (search & tools)

Returns the documents containing the both the words focuseek and searchbox or both the words search and tools.

(focuseek | "search tool") & price

Returns the documents containing the word focuseek or the phrase search tool that also contain the word price.

((focuseek | "search tool") & price) & authordate:20050101:

Returns the documents containing the word focuseek or the phrase search tool that also contain the word price and which were authored since January 1st, 2005.

Some final points on syntax

You might occasionally need to search for a character searchbox uses in its query syntax (e.g. & or :). This most usually occurs with metadata values as searchbox, as shipped, doesn't allow these characters into words. In order to do this you can prefix the character with a single \ (backslash character). Should you need to search for a backslash character simply insert two consecutive backslash characters.

Example 11.10. Using special characters in the query

pubdate:2005\:01\:01:2006\:12\:31

Search for all the documents marked with the metadata key pubdate and with values in the range 2005:01:01 up to 2006:12:31 including the boundaries.

Now we get back to metadata queries. As described in the section called “Gathering "side metadata"” metadata passed to searchbox can be marked to be indexed exactly or to be normalized, tokenized or both. This influences your queries in two ways.

First, while "normal" text search in searchbox is case insensitive, metadata search is always case sensitive, both for key and value; thus authordate:20050101 and Authordate:20050101 return different results. However you can tell searchbox that you want to index a specific metadata in normalized form, meaning searchbox will store the metadata value in the same way it stores normal text[18]. Thus if you store normalized metadata you can query them using lowercase values.

Second, but more important, metadata values are not split into words unless you specify to index the metadata as tokenized. Thus, if your document is marked by the metadata key:val1 val2 (i.e. a single metadata with a value containing a space) queries for key:val1 or key:val2 will return no results. See also Example 11.11, “Querying tokenized and normalized metadata”.

Example 11.11. Querying tokenized and normalized metadata

In this example we will assume searchbox contains the following documents (in pseudo-FFF format):

<fff id="docA">
  <!-- A lot of stuff is omitted here --!>
  <meta key="key1">value1 value2<meta>
  <meta key="key2" normalized="1">value1<meta>
</fff>

<fff id="docB">
  <!-- A lot of stuff is omitted here --!>
  <meta key="keyt1" tokenized="1">value1 value2<meta>
  <meta key="key2" normalized="1">VALUE1<meta>
</fff>

<fff id="docC">
  <!-- A lot of stuff is omitted here --!>
  <meta key="key1">value0 value1 value2 value3<meta>
  <meta key="key2">value1<meta>
</fff>

<fff id="docD">
  <!-- A lot of stuff is omitted here --!>
  <meta key="keyt1" tokenized="1">value0 value1 value11 value2<meta>
  <meta key="key2">VALUE1<meta>
</fff>

Let's go on and look at some queries:

key1:value1

Returns docB and docD: it looks for documents containing the "word" value1 in metadata key1.

key1:value1 key1:value2

Returns docB and docD: it looks for documents containing both the "words" value1 and value2 in metadata key1.

"key1:value1 key1:value2"

Returns only docB: looks for documents containing a phrase made up exactly of the "words" value1 and value2 in metadata key1.

key1:value1\ value2

Returns only docA: looks for documents containing the single "word" value1 value2 (note the blank separating value1 and value2) in metadata key1.

key1:value1\ value11

Returns nothing: no document contains the "word" value1 value11 in metadata key1. docD contains the "word" value0 value1 value11 value2 but it is not the same exact word we are looking for.

"key1:value0 value2"

Returns nothing: no document contains the exact "phrase" value0 value2 in metadata key1. docD contains these "words" but they are interspersed with other words.

key2:VALUE1

Returns only docD.

key2:value1

Returns docA, docB and docC.

Creating a new Watch

Select the Collection Tab of the left pane.

Figure 11.5. The Watches Tab

The Watches Tab

Clicking on the New button a windows popup to choose the collection whit the new Watch have to refer to.

Figure 11.6. Choose the Collection to use with a Watch

Choose the Collection to use with a Watch

Each Watch can be associated to only one Collection at time. If you need to monitor Archives which are collected in different Collections you have to create a new Collection for this purpose.

In order to complete the basic configuration of a Watch some fields of the Info Tab of the configuration panel must be filled.

Figure 11.7. The Info Tab of Watch configuration panel

The Info Tab of Watch configuration panel

Into the General box:

Name: is the short name of the Watch.

Description: is a brief description of the Watch

Filter: is the query that Watch uses to filter documents of its corresponding Collection. The Weight... button let you modify weights of slices exactly as in the corresponding Browsing Tab of Archives and Collections.

View modified docs: let you choose to view all changed documents or only those which have the core text changed.

Sort: by default the Watch shows results ordered by their timestamp. searchbox 2.2 introduces for Watches other sorting methods: Score, Newer and Older. The Default entry is the same as Newer.

Setting Watch freshness

Into the Fresheness box of Info Tab of Watch configuration:

Newer than: it sets the maximum age that documents must have to be returned by the Watch.

By default it is set as 7 days.

Creating a new notificator for Watch

The Notification Tab shows all active notificators for the current Watch. Clicking the Add... button located at the bottom. a new window appears.

Figure 11.8. Watch notification window

Watch notification window

Using this panel is possible to setup a notification process that warn a user about new documents that are gathered by searchbox and match the watch filter.

In the General Tab there are the following fields:

Name: the notificator name

Timing: when the notification should be sent

Ignore watch freshness: to override the corresponding watch preference

Query detail: detail level of the query performed by the watch (for SOAP notification only). Possible values are:

Figure 11.9. SOAP Notification detail level

SOAP Notification detail level
  • None. No details.

  • URL. Only URL.

  • Title. Title too.

  • Context. Context too.

  • Template metadata. Template metadata too.

  • All metadata. All details.

Notify detail: detail level of the query performed by the watch (all other notifications). Possible values are:

Figure 11.10. Notification detail level

Notification detail level
  • No details.

  • Watch info only.

  • Full results.

From the Recipients Tab, clicking on New... button is possible to add a new entry to the current list (initially empty) of recipients for the notificator.

Figure 11.11. The Notificator Recipient list

The Notificator Recipient list

Each recipients must have an Address and a Media by which he/she want to receive the notification from the watch. The Address: field must be valid for the chosen media. Possibles values for Media: field are:

  • E-Mail. An email to a specified address in the Address: field.

  • Jabber. An IM message to a Jabber account specified in the Address: field.

  • SOAP. A callback to a Web Service. The endpoint must be specified in the Address: field.

In order to test if a notificator has been correctly configured it can be immediately tested using the Notify now button in the Info Tab of Watch configuration panel.

Browsing Watch results

Browsing Watch results is very similar to making a query on an Archive or Collection apart that the default filter is automatically applied.

Figure 11.12. Watch Browsing

Watch Browsing

Differences are:

  • The Refine button concatenate the content of the text box with the default filter with the AND operand.

  • With the Freshness: drop-down is possible to change the age limit of reported documents. This parameter is valid only for the current browsing session.

Viewing Watch results into an RSS aggregator

By default searchbox let you access to Watch results as an RSS stream. In the Notifications Tab of Watch configuration panel the Copy RSS URL button copy the url of the RSS stream associated with Watch results into the clipboard so that it can be pasted into an external RSS aggregator.

Figure 11.13. The Copy RSS button

The Copy RSS button

In the following picture the RSS configuration window of Mozilla Thunderbird.

Figure 11.14. The Mozilla Thunderbird RSS configuration

The Mozilla Thunderbird RSS configuration

The searchbox Enterprise Search Portal (ESP)

Starting from release 2.2 searchbox embeds the Enterprise Search Portal (ESP) Ajax application to easily browse documents into collection using both full text and associated metadata. Such application has been implemented to provide a very interactive document search anche browsing experience within your preferred browser.

From Wikipedia:

"Ajax, shorthand for Asynchronous JavaScript and XML, is a web development technique for creating interactive web applications. The intent is to make web pages feel more responsive by exchanging small amounts of data with the server behind the scenes, so that the entire web page does not have to be reloaded each time the user requests a change. This is meant to increase the web page's interactivity, speed, and usability."

After a query the ESP appears as shown in the following picture:

Figure 11.15. The Enterprise Search Portal

The Enterprise Search Portal

ESP is a great tool to browse huge document archives gathered into searchbox Collections. It lets you perform incremental queries, group and sort results with various criteria.

Accessing to ESP

The ESP is accessible from any modern Jacascript enabled Web browser like IE 6, Firefox 1.5 or Safari 2 at the address:

http://<hostname>:<portnumber>

where hostname is localhost and portnumber is 2200 if you are working on the same machine where searchbox is installed on.

ESP is accessible both as anonymous and authorized user.

Make a query

ESP can browse groups of archives that are previously organized in collection from the Control Panel. From the Where dropdown you can choose the collection where ESP will perform the query.

ESP shows results while you type into the search box starting from the fourth character. This feature is active only for single-word query, in case you use more concatenated words a standard query is performed. The incremental query can also be finalized pressing Return on the keyboard.

Group results by different criteria

ESP can group query results using specific document metadata. Such metadata are automatically provided by embedded document analyzers. The list of user's available grouping methods can vary and can be configured from the Control Panel in to the Collection Tab.

Sort query results

Within each group of results ESP can use various sorting criteria. The Relevance criteria sorts results according the internal searchbox relevance algorithm. Other available methods are Time (sorting by increasing or decreasing timestamp) and MIME (sorting by document format) but more in general ESP can perform grouping by any metadata injected into documents by plugins and/or metadata templates.

Set a time window for query results

ESP can also return results within a specific time interval. This control let users choosing the desidered time interval starting from the current date in a very immediate way. The document timestamp is set by default as equal to time when the document is fetched from the source.

Expand & Collapse groups

Each group of results can be expanded or collapsed using the rotating arrow located at the left side of any group header bar. ESP automatically decide to show some groups as collapsed in order to show all available groups in a single page.

Getting more info

For each returned document more detailed info can be obtained. Clicking on the "i" sign the corresponding row is expanded showing the actual MIME type string and the original context where the search string has been found into the document.



[12] This examples actually works only if you enabled the SLR plugin in your searchbox source.

[13] This assumes your documents are labeled with a metadata with key authordate and a value encoding their authoring date as YYYYMMDD.

[14] Swapping two words counts as two move operations: one for each word.

[15] Actually the arguments to boolean operators are not limited to words but can be any parenthesized expression. For the sake of simplicity we will use only simple words now and explain the full syntax later.

[16] For clarity we will not omit them in our examples.

[17] But not the same as "focuseek searchbox" which will insist that both words are adjacent in the document

[18] More or less this implies that the metadata value is stored in lowercase. In addition some other manipulations are made to avoid some ambiguities in the way characters can be represented in unicode.