Chapter 9. Rendering

Table of Contents

Exploiting Metadata
The "Focuseek Flexible Format" (FFF)
On-line Metadata Injection: The Rendering Plugins Pipeline
Configuration of a Rendering Plugin
Activation of a Rendering Plugin

In this section we would like to provide some more detailed information about the searchbox rendering process.

Exploiting Metadata

The overall rendering process consists in converting the original fetched document into an internal XML representation that contains all extracted features from the original document modeled as metadata. In this context we call feature every characteristic that can be directly extracted and/or inferred from the analysis of the document. The list of words the document is composed by is a feature but the layout of the document is a feature too. Other more advanced features are for instance the semantic associated to portions of the document. The extraction of such kind of feature is delegated to specific software agents embedded in searchbox using the plugin system.

The FFF format, described in the following section and in more detail into the Programmer's Reference Guide, is the internal XML document format used by searchbox.

The "Focuseek Flexible Format" (FFF)

searchbox 2.2 internal document representations is based on FFF version 1.0. The minimal basic set of features considered by the FFF 1.0 document format are:

  • Hyperlinks

  • Paragraphs

  • List of words

  • Metadata

Into the first section of FFF hyperlinks contained in the document are listed. Attributes of the <link>...</link> tag are: urlid, url and score. The urlid is a identifier of the hyperlink assigned by searchbox, the url is the complete URL of the link and score is a real number between 0 and 1 representing a score value calculated by default by the rendering process.

The paragraph is the main text unit considered by the FFF. It represents a block of text extracted from original document taking into account its layout elements. Into the FFF a paragraph is delimited by <p>...</p> tag.

Each paragraph is characterized by a list of word delimited by <low>...</low> tag. The <low> tag has three possible attributes: dictid, sliceid and urlid. The first two attributes identify the slice of the index where the list of words must be stored in and the last one a pointer to the hyperlink defined before into <link>...</link> tag.

Here it is and example of an FFF created from an HTML page of the focuseek web site

<?xml version="1.0"?>
<document id="0a.e39e5e64b03ce230d138a51e23a39186" url="http://www.focuseek.com/" timestamp="1119275665" score= "1.0">
 <fff>
  <link urlid="0" url="http://www.focuseek.com/index.html" score="0.000123"/>
  <link urlid="1" url="http://www.focuseek.com/info.html" score="0.000123"/>
  <link urlid="2" url="http://www.focuseek.com/buy.html" score="0.000123"/>
  <link urlid="3" url="http://www.focuseek.com/documentation.html" score="0.000123"/>
  <link urlid="4" url="http://www.focuseek.com/downloads.html" score="0.000123"/>
  <link urlid="5" url="http://www.focuseek.com/contacts.html" score="0.000123"/>
  <link urlid="6" url="http://www.focuseek.com/webapps.html" score="0.000123"/>
  <link urlid="7" url="http://www.focuseek.com/enterpriseapps.html" score="0.000123"/>
  <link urlid="8" url="http://www.focuseek.com/scalableapps.html" score="0.000123"/>
  <contents>
   <p><s><low dictid="15" sliceid="title">focuseek</low></s></p>
   <p><s><low dictid="7" sliceid="marginalLink" urlid="0">focuseek</low></s></p>
   <p><s><low dictid="7" sliceid="marginalLink" urlid="0">Home</low></s></p>
   <p><s><low dictid="7" sliceid="marginalLink" urlid="1">Info</low></s></p>
   <p><s><low dictid="7" sliceid="marginalLink" urlid="2">Buy</low></s></p>
   <p><s><low dictid="7" sliceid="marginalLink" urlid="3">Documentation</low></s></p>
   <p><s><low dictid="7" sliceid="marginalLink" urlid="4">Downloads</low></s></p>
   <p><s><low dictid="7" sliceid="marginalLink" urlid="5">Contacts</low></s></p>
   <p><s><low dictid="10" sliceid="centralNorm">searchbox 2 With searchbox a new family of search products is born.</low></s><s><low dictid="10" sliceid="cen   <p><s><low dictid="14" sliceid="centralHeader">Choose your path</low></s></p>
   <p><s><low dictid="10" sliceid="centralNorm">Are you planning a large-scale intranet search facility?</low></s><s><low dictid="10" sliceid="centralNorm">O   <p><s><low dictid="12" sliceid="centralLink" urlid="6">Web applications</low></s></p>
   <p><s><low dictid="5" sliceid="marginalNorm">Aggregate and publish Web content in public or corporate portals.</low></s></p>
   <p><s><low dictid="7" sliceid="marginalLink" urlid="7">Enterprise applications</low></s></p>
   <p><s><low dictid="5" sliceid="marginalNorm">Search and monitoring technology for small and medium businesses.</low></s></p>
   <p><s><low dictid="7" sliceid="marginalLink" urlid="8">Scalable applications</low></s></p>
   <p><s><low dictid="10" sliceid="marginalNorm">GRID technology to handle hundreds millions documents and thousands users.</low></s></p>
   <p><s><low dictid="5" sliceid="marginalNorm"> © 2005 focuseek</low></s></p>
  </contents>
  <meta type="documentwide" dictid="2" sliceid="invisible" key="reliable">yes</meta>
  <meta type="documentwide" dictid="2" sliceid="invisible" key="page">home</meta>
 </fff>
</document>

and the complete DTD of FFF 1.0 format. Please see the Programmer's Reference Manual for more detailed information.

<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT contents (p+)>
<!ELEMENT document (fff)>
<!ATTLIST document
    id NMTOKEN #REQUIRED
    score NMTOKEN #REQUIRED
    timestamp NMTOKEN #REQUIRED
    url CDATA #REQUIRED
>
<!ELEMENT fff (link+,contents)>
<!ELEMENT link EMPTY>
<!ATTLIST link
    score NMTOKEN #REQUIRED
    url CDATA #REQUIRED
    urlid NMTOKEN #REQUIRED
>
<!ELEMENT low (#PCDATA|p)*>
<!ATTLIST low
    dictid (1|2|3|4|5|6|7|9|10|11|12|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|39|31) #REQUIRED
    sliceid CDATA #REQUIRED
    urlid NMTOKEN #IMPLIED
>
<!ELEMENT p (s+)>
<!ELEMENT s (low)>

In order to further expand the FFF expressivity, Free Metadata and Template Metadata can be used as described in the following two sections.

From the Control Panel is possible to show the FFF of any document using the FFF button of the Browsing tab of Archive configuration panel as described in the "Showing a document from Archive" section.

On-line Metadata Injection: The Rendering Plugins Pipeline

Every fetched document is online processed by the modules pipeline of the searchbox rendering component. Every module of the pipeline analyzes the output FFF from the previous one in the pipeline and can add to it some further metadata as output of its processing on the document. This action is referred as On-line Metadata Injection because is performed for all documents coming from a Source at the fetching time depending from the pipeline configuration associated to that Source. At the rendering time the user has no control on what happens to the document: all plugged modules associated to a Source will be invoked in sequence on the current FFF.

Configuration of a Rendering Plugin

The Rendering plugin is the software module part of the rendering pipeline that every Source owns. In order to be used into the pipeline modules must be configured before. The Plugins Tab of the left pane in Control Panel

Figure 9.1. The Plugins Tab

The Plugins Tab

shows a list of all already available (configured) rendering modules. Such modules cam be added to the rendering pipeline of any Source without further configuration.

Figure 9.2. The default plugins

The default plugins

Note

The list of default plugins may differ from that in the figure because it is specific of the exact searchbox version you are using

Clicking on the New... button two way of creating a new plugin are shown in a popup window.

The first method consists in creating a new plugin with a initially empty configuration. The new plugin must have a type chosen from those available in the related dropdown as shown in the following picture.

Figure 9.3. New Plugin creation window

New Plugin creation window

Another way to do the same thing is to create a new plugin inheriting it from one of those already configured. In this case the basic configuration will be the same of the plugin it is inherited from. This procedure is shown in the following picture.

Figure 9.4. Inherit Plugin creation window

Inherit Plugin creation window

Activation of a Rendering Plugin

The rendering plugin is activated into the rendering pipeline associated to a Source. At the bottom of the Plugins Tab of a Source the list of available and already configured plugin is shown.

Figure 9.5. Available Rendering plugin

Available Rendering plugin

Plugins in the pipeline can be also be reordered by the Up and Down buttons.