Table of Contents
In this section we would like to provide some more detailed information about the searchbox rendering process.
The overall rendering process consists in converting the original fetched document into an internal XML representation that contains all extracted features from the original document modeled as metadata. In this context we call feature every characteristic that can be directly extracted and/or inferred from the analysis of the document. The list of words the document is composed by is a feature but the layout of the document is a feature too. Other more advanced features are for instance the semantic associated to portions of the document. The extraction of such kind of feature is delegated to specific software agents embedded in searchbox using the plugin system.
The FFF format, described in the following section and in more detail into the Programmer's Reference Guide, is the internal XML document format used by searchbox.
searchbox 2.2 internal document representations is based on FFF version 1.0. The minimal basic set of features considered by the FFF 1.0 document format are:
Hyperlinks
Paragraphs
List of words
Metadata
Into the first section of FFF hyperlinks contained in the document are listed. Attributes of the <link>...</link> tag are: urlid, url and score. The urlid is a identifier of the hyperlink assigned by searchbox, the url is the complete URL of the link and score is a real number between 0 and 1 representing a score value calculated by default by the rendering process.
The paragraph is the main text unit considered by the FFF. It represents a block of text extracted from original document taking into account its layout elements. Into the FFF a paragraph is delimited by <p>...</p> tag.
Each paragraph is characterized by a list of word delimited by <low>...</low> tag. The <low> tag has three possible attributes: dictid, sliceid and urlid. The first two attributes identify the slice of the index where the list of words must be stored in and the last one a pointer to the hyperlink defined before into <link>...</link> tag.
Here it is and example of an FFF created from an HTML page of the focuseek web site
<?xml version="1.0"?> <document id="0a.e39e5e64b03ce230d138a51e23a39186" url="http://www.focuseek.com/" timestamp="1119275665" score= "1.0"> <fff> <link urlid="0" url="http://www.focuseek.com/index.html" score="0.000123"/> <link urlid="1" url="http://www.focuseek.com/info.html" score="0.000123"/> <link urlid="2" url="http://www.focuseek.com/buy.html" score="0.000123"/> <link urlid="3" url="http://www.focuseek.com/documentation.html" score="0.000123"/> <link urlid="4" url="http://www.focuseek.com/downloads.html" score="0.000123"/> <link urlid="5" url="http://www.focuseek.com/contacts.html" score="0.000123"/> <link urlid="6" url="http://www.focuseek.com/webapps.html" score="0.000123"/> <link urlid="7" url="http://www.focuseek.com/enterpriseapps.html" score="0.000123"/> <link urlid="8" url="http://www.focuseek.com/scalableapps.html" score="0.000123"/> <contents> <p><s><low dictid="15" sliceid="title">focuseek</low></s></p> <p><s><low dictid="7" sliceid="marginalLink" urlid="0">focuseek</low></s></p> <p><s><low dictid="7" sliceid="marginalLink" urlid="0">Home</low></s></p> <p><s><low dictid="7" sliceid="marginalLink" urlid="1">Info</low></s></p> <p><s><low dictid="7" sliceid="marginalLink" urlid="2">Buy</low></s></p> <p><s><low dictid="7" sliceid="marginalLink" urlid="3">Documentation</low></s></p> <p><s><low dictid="7" sliceid="marginalLink" urlid="4">Downloads</low></s></p> <p><s><low dictid="7" sliceid="marginalLink" urlid="5">Contacts</low></s></p> <p><s><low dictid="10" sliceid="centralNorm">searchbox 2 With searchbox a new family of search products is born.</low></s><s><low dictid="10" sliceid="cen <p><s><low dictid="14" sliceid="centralHeader">Choose your path</low></s></p> <p><s><low dictid="10" sliceid="centralNorm">Are you planning a large-scale intranet search facility?</low></s><s><low dictid="10" sliceid="centralNorm">O <p><s><low dictid="12" sliceid="centralLink" urlid="6">Web applications</low></s></p> <p><s><low dictid="5" sliceid="marginalNorm">Aggregate and publish Web content in public or corporate portals.</low></s></p> <p><s><low dictid="7" sliceid="marginalLink" urlid="7">Enterprise applications</low></s></p> <p><s><low dictid="5" sliceid="marginalNorm">Search and monitoring technology for small and medium businesses.</low></s></p> <p><s><low dictid="7" sliceid="marginalLink" urlid="8">Scalable applications</low></s></p> <p><s><low dictid="10" sliceid="marginalNorm">GRID technology to handle hundreds millions documents and thousands users.</low></s></p> <p><s><low dictid="5" sliceid="marginalNorm"> © 2005 focuseek</low></s></p> </contents> <meta type="documentwide" dictid="2" sliceid="invisible" key="reliable">yes</meta> <meta type="documentwide" dictid="2" sliceid="invisible" key="page">home</meta> </fff> </document>
and the complete DTD of FFF 1.0 format. Please see the Programmer's Reference Manual for more detailed information.
<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT contents (p+)>
<!ELEMENT document (fff)>
<!ATTLIST document
id NMTOKEN #REQUIRED
score NMTOKEN #REQUIRED
timestamp NMTOKEN #REQUIRED
url CDATA #REQUIRED
>
<!ELEMENT fff (link+,contents)>
<!ELEMENT link EMPTY>
<!ATTLIST link
score NMTOKEN #REQUIRED
url CDATA #REQUIRED
urlid NMTOKEN #REQUIRED
>
<!ELEMENT low (#PCDATA|p)*>
<!ATTLIST low
dictid (1|2|3|4|5|6|7|9|10|11|12|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|39|31) #REQUIRED
sliceid CDATA #REQUIRED
urlid NMTOKEN #IMPLIED
>
<!ELEMENT p (s+)>
<!ELEMENT s (low)>In order to further expand the FFF expressivity, Free Metadata and Template Metadata can be used as described in the following two sections.
From the Control Panel is possible to show the FFF of any document using the button of the Browsing tab of Archive configuration panel as described in the "Showing a document from Archive" section.
Every fetched document is online processed by the modules pipeline of the searchbox rendering component. Every module of the pipeline analyzes the output FFF from the previous one in the pipeline and can add to it some further metadata as output of its processing on the document. This action is referred as On-line Metadata Injection because is performed for all documents coming from a Source at the fetching time depending from the pipeline configuration associated to that Source. At the rendering time the user has no control on what happens to the document: all plugged modules associated to a Source will be invoked in sequence on the current FFF.
The Rendering plugin is the software module part of the rendering pipeline that every Source owns. In order to be used into the pipeline modules must be configured before. The Plugins Tab of the left pane in Control Panel
shows a list of all already available (configured) rendering modules. Such modules cam be added to the rendering pipeline of any Source without further configuration.
The list of default plugins may differ from that in the figure because it is specific of the exact searchbox version you are using
Clicking on the button two way of creating a new plugin are shown in a popup window.
The first method consists in creating a new plugin with a initially empty configuration. The new plugin must have a type chosen from those available in the related dropdown as shown in the following picture.
Another way to do the same thing is to create a new plugin inheriting it from one of those already configured. In this case the basic configuration will be the same of the plugin it is inherited from. This procedure is shown in the following picture.
The rendering plugin is activated into the rendering pipeline associated to a Source. At the bottom of the Plugins Tab of a Source the list of available and already configured plugin is shown.
Plugins in the pipeline can be also be reordered by the and buttons.