Table of Contents
searchbox is a complete toolbox that let you setup in a very few time a complete content gathering system. A content gathering system is different from a standard search engine essentially because it can gather information from a lot of different source types and not only from the Web. This feature let a content gathering system to be useful for enterprise search.
searchbox adds some other proprietary features about metadata management. Trough its Rendering module searchbox is able to automatically associate metadata to single documents and use them as retrieval handles.
searchox also provide an interesting set of publishing options. Other than the standard query language is able to perform a multiformat notification of new gathered creating specific channels.
The following picture shows the three basic action about using searchbox and the components/concepts that are involved in.
Gathering (Take what you need). searchbox use some basic concepts to manage gathering from local and remote information sources. This configuration activity is usually done once when a new Source has to be added to the gathering system or its behaviours are changed.
Rendering (Exploit the Semantic). This phase heavily depends from feature extractors that are installed into your searchbox instance. In general terms the rendering action is automatically performed by searchbox every time a new document is fetched. At rendering time all feature extractors are applied to the document and the information produced by them is injected into the document itself as structured metadata.
Publishing (It's time to find). searchbox is an "always on" system: once a new document is fetched it is immediately rendered and published to final users. searchbox provide both POP and PUSH publishing feature through its query and notification interface. During searchbox uptime all publishing services are always available .
From technological point of view, searchbox is a very sophisticated software but that can be easily configured by any user through few and powerful basic concepts.
It is a place where information resides. A source can be thematic on a specific topic or completely generic with information regarding many topics at the same time. If you need to find a sport news a sport newspaper is for sure a good candidate Source but if you need information about health probably the best choice is a medical magazine. A library is also a good information source even if usually is not digital, on the contrary the hard disk of your Personal Computer or a Compact Disc are also information sources but in digital format. A source is called Digital Source when a computer can acquire information from it using a specific protocol.
searchbox is able to acquire information from many types of digital sources, it only need to know some information about the source itself like the physical address, the access protocol and the user identification credentials if needed.
The types of Sources that searchbox can manage are many and enough to outline a very interesting applicative context.
The SB "range of action" starts from the server where SB is installed and arrives to every reachable resource. In the first layer there are local resources like the email folders and local disks. Depending from the access privileges granted to the system, SB is able to configure all these local resources as Sources.
Local network resources can be configured by SB as Sources too. Also in this other case the suitable access privileges must be granted.
The most "remote" type of accessible resources reachable by SB are the Internet ones. SB can natively access to Web sites, newsgroup servers, Web Services, RSS stream, etc. so that the concept of Source can be generalized as "any physically reachable resource known by its URL".
he rate of new information we can gather from an information source defines how such source is dynamic. A novel is a static information source, an encyclopedia is more dynamic because its annual update, finally a newspaper is a very dynamic due to its daily editions. The concept of "edition" is not strongly applicable to the case of digital sources where often there is a continuous stream of news; the electronic version of newspapers available on the Web are real-time updated as soon news come to the editorial office.
searchbox can record the history of all the information produced by a Source. For every Source we can create a multi-level cache where each level represents a time-slice of the Source. All the search capabilities of searchbox are effective on this type of Archives.
Figure 5.3. The three dimensional structure of a searchbox Digital Source: the (x,y) space and its time evolution (t)

Due to its multi level caching system searchbox is the ideal tool to implement "mining" and dynamic content monitoring applications.
It aggregates many information Sources (and their Archives) in a unique access point. The Collection supports a standard query language that combines keywords and attributes with standard AND, OR, NOT, NEAR operators as in many other Internet search engines.
Using a Collection object is possible to create a specialized index from contents coming only from selected information Sources.
This last basic concept let searchbox to be used both as a powerful content monitoring tool and as news channel.
Using a persistent query the Watch is a view on a Collection. Every time the Watch is shown it produces the list of the newest information contained in the corresponding Archives, filtered by the persistent query and timestamp ranked. The output of the Watch can be considered like a "press review" of the most recent interesting news coming from some information Sources.
The following picture shows how the above concepts are connected each to the others.
They organized in three different groups:
Gathering group. The source is the central gathering concept. The goal of a source is that of grouping a certain number of seeds and to configure a suitable access protocol. A seed can be a database table or a Web page, a complete Web site, a specific portion of the Web or a fully custom data repository. The source natively supports access to seeds with digital certificate, password, cookies etc. A seed can be shared by many sources.
Indexing group. An archive represents an index and a repository of contents coming from a given source. Multiple archives can be connected to a single source, since every archive can have different configuration rules for its creation (i.e. caching or not, different access credentials for different users, etc.). Finally, in order to create indexes from different kinds of sources, many archives can be grouped together to form a collection. The collection represents a way to aggregate sources that are heterogeneous from the point of view of seeds, but that are homogeneous in terms of topic (i.e. all the Italian newspapers). Both archives and collections are query-able objects.
Monitoring group. searchbox can also be used as a monitoring tool. Watches contain a set of static filters on the information stream coming from a collection. A watch can be used to implement a customized view on any information stream originated from a group of dynamic sources through the corresponding collections. Watches support subscriptions from any client application that needs to be alerted as soon a specific condition is matched.
Gathering data from original sources is one of the main problems in digital content integration and delivery. A very typical scenario is when you have to gather information from many, heterogeneous digital sources that are geographically distributed too. Owners of such digitals sources are focused on their original mission of content production and usually do not provide a standard way to access their archives by the means of other applications. This situation is due to many factors but it is easily understandable that information is the main value of a content provider and he/she desires strict control on how it is delivered. As results of this status in many cases content providers do not really care whether the user wants to use other applications to access their information through standard protocols and formats. This situation is not the ideal one from the point of view of the user who has many content providers to interface with because he/she is forced to setup and maintain a custom communication channel with each. Such channels are characterized by custom user interfaces and are often very hard to be integrated in other applications. A possible solution to this kind of problems comes from a custom declination of the approach that is currently used by Search Engines for Web plus the Web Service technology.
Web Search Engines cannot influence in any way how web sites publish their information so that if an engine wants to build an index of the content provided by some site it must access the web site on his own. The method used by Search Engines to accomplish that task is called “crawling” or “spidering”. A web crawler is a software agent that simulates a real user accessing a web site and read all the information contained in it. In order to succeed with this task a crawler must have a toolbox with any possible “adapter” able to match all access protocols and document formats available on the web. After not so many years after its birth, the WWW begun to support other protocols than the original HTTP and many other document formats other than HTML. Formats like PDF, DOC, Flash and protocols like NNTP, FTP and ODBC (some of which actually predate the HTTP over HTML web standard medium) forced Web crawlers to adapt themselves to the new situation. The basic assumption of a typical Web crawler is that any information source must be treated like a “black box” with no way to contact the webmaster to ask him/her to adapt content for a specific usage. From the Web source point of view a Web crawler is like any other normal user that visits the site. This particular approach is very powerful because it has zero organizational and technical impact on the information sources and for this reason it has been successfully adopted in the enterprise environment too. In any large company or public administration the goal of aggregating content from different and heterogeneous sources (even if they are located and managed by the company itself), is really hard to be accomplished. Exporting data from an existing database means that either or both the organizations providing and using the content has to obtain the necessary authorizations, writing some software and thus allocate some human resources. All those reasons are serious potential point of failure for any content integration project. In this type of scenario a crawling technology can enormously simplify the integration task because the crawler acts exactly like any other authorized user whose accessing procedures are already defined and accepted by all departments of any company.
An interesting way to visualize the content gathering problem is to imagine that in order to acquire information we have to setup a channel connecting the content provider and the users. Using the already discussed “search engine” approach a possible solution is to create a system able to aggregate many different information sources and provide some standard application services to access it. In this way users will only need to know the standard application interface provided by the gathering system.
Figure 5.7. The Content Gathering component as bridge between content providers and client application

At the left side of the above picture the heterogeneous world of content provider is sketched. Different shapes represent the different protocols and formats used to access to the content. At the opposite side there is a structured repository that needs to be filled from contents coming from content provider. The middle component is the content gathering module which choose the right adapter to gather information from any content provider and exposes some standard services:
Is able to retrieve any piece of information in the repository through a query composed by words or metadata separated by the AND, OR, NOT and NEAR operators typical of any search engine. The indexing is implemented using a full dynamic indexing service in order to take in account when a new content is added to the repository. No index rebuild is needed.
Used to automatically feed newly acquired contents through a channel. A very common standard like RSS can be used for this purpose.
Generates events to notify that something is changed in the repository. Alerting methods use email messages, Instant Messaging, SMS and Web Service calls.
The above services can be used by a client side component to build any kind of structured object based on the original “raw” information gathered from content providers. Obviously any type of structure provided by the content provider itself will be preserved and indexed too.
From the searchbox point of view a just fetched document is a completely opaque item, a binary object that must be properly threated in order to let users to retrieve it later using its specific features.
A good representation of a group of fetched document to be processed by searchbox is shown here.
A group of spheres with very smooth surfaces with no possibility to distinguish one from each other.
The rendering process consists in analysing the content of any document and reveal all possible feature of it.
Spheres of our example now, after the rendering process, have some "handles". Such handles are unique for each document but must belong to a specific kind of feature. Usually the set of feature that can be extracted from a document depends from the type of the documents itself. A very basic type of feature that can be extracted from text documents is the "list of words" used to implement full-text indexing. For other types of documents like pictures, videos, etc. other possible types can be: "author", "duration", "category" and so on.
searchbox implements by defaults only some of those feature extractors (i.e. the "list of words" extractor for full-text search) and a special plug-in system that accepts custom processing modules for specific documents formats.
Once our documents have their handles revealed, the searchbox query engine can easily use them to retrieve documents that share one or more shared characteristic.
The above picture is a visual representation of the handle-based retrieval model performed by searchbox. Each ring is a query and contains a chain of spheres (documents) that share a specific feature (i.e. all documents that contains the word "computer" or all documents that are videos and are no longer that one minute).
searchbox deeply differs from a traditional relational DBMS from two main key aspects:
searchbox does not has an internal structured model of data it has to store. It dynamically builds its internal data structure on-the-fly as soon it "see" data for the first time. The searchbox administrator does not need to define how data are structured but only how data can be reached by an external software agent. With this approach searchbox can build its own private view of any information source independently from how data are internally structured inside the source itself.
structured information are attached to the single instance of documents both at the fetching time and when offline editors inject specific metadata. In this way the searchbox is particularly suitable for dynamical document repositories with a big turnover.
searchbox also requires a very low administration effort compared with any enterprise level database and it can be usually installed in any existing computing environment in minutes.
Even if searchbox can be used in many applications where database are currently involved its main use should be limited only for those situations where unstructured information must be gathered and/or managed (i.e. document management). In all other cases a standard database usually works well.
searchbox works different from any other search engine on the market. searchbox is able perform a retrieval task using any piece of information that is able to extract from any digital document.
The searchbox action is not limited to a full-text retrieval but it depends from rendering agents that are active on a specific information source. Such agents are organized as a processing pipeline that is applied to every fetched document.
The following picture is an overview of the searchbox rendering process.
The rendering process extracts from the original document a set of features that will be coded in an intermediate internal XML format called FFF (Focuseek Flexible Format). The document processing inside the rendering module is defined by a Plug-In chain where each Plug-In is responsible to extract a specific set of features to be indexed. If no Plug-In of the chain is active the rendering module will only extract the text and its paragraph organization. In this case the searchbox indexing module will perform a simple full-text indexing of the document.
The searchbox indexer is able to index all features that plugins can extract from documents and organize them in with a proprietary structure deeply connected to the rendering process described before. The basic idea is that the rendering module is able to understand the layout structure of a page so it is able to assign to each portion of text a specific role. A typical "role" for e rendered text can be "title", "central text" or "marginal note". A specific weight is assigned to each role so that the ranking applied to the results of a query will depends from the roles that the keywords have into the retrieved pages. A typical important role is title while a less important one is the footnote. searchbox model the situation distributing the content of a page on different layers each corresponding to a specif role and separately indexed.
The searchbox supports this layered architecture so that is composed by 31 different slices. Slices ranging from 1 to 15 are defined by default both as role and associated weight while the other 16 are customizable. Using custom rendering Plug-Ins such slices can be populated with information extracted from the document or with generated metadata. At query time it is possible to specify which slices we want to use and eventually modify their default weights.
A very powerful and innovative feature proposed by searchbox is the possibility to notify someone or something of the presence of any new information entered in the system. This feature is especially useful when the archives are extremely dynamic and there is a big turnover of information.
The component that implements this feature is the Watch. For every Watch a list of notificator can be configured. A notificator is an endpoint of an external service that is supposed to be listening searchbox messages about new documents satisfying the current Watch configuration.
Possible notificators types are:
RSS stream. Generated by default for all Watch results
EMAIL message. News are containted in a standard email message.
IM message. Like the mail message but send an instant message using all major available instant messaging protocols
SOAP call. All new documents are passed to another Web Service using a specific SOAP call.
With such notification feature it is possible to implement with searchbox a tiny but effective workflow system.
searchbox supports all the most commons document formats: HTML, Microsoft Word, PDF, RTF, Text and Internet mail message (RFC822). Once fetched all documents are transformed in an internal XML format (FFF) with UTF8 representation. An important aspect to notice is that despite searchbox uses an internal XML format for documents, it is not an XML database and does not support any XML query standard.
The maximum size for a single fetched document must not exceed 16MB, if it happen the extracted text will be truncated for some document formats (i.e. HTML) or null for others (i.e. MS Word).
At this moment the supported MIME types are:
| Document type | MIME type | Notes |
|---|---|---|
| ASCII text | text/plain | All words contained in documents are indexed ignoring line endings. A short line is considered to be a paragraph break. |
| HTML | text/html | HTML is supported in all versions up to HTML 4.01, XHTML is supported in all 1.x versions. The HTML parser is generally very robust with relation to malformed or invalid HTML. Images, style sheets, style tags and javascript are ignored. Framesets are supported as a source for links to the framed pages. Client side imagemaps are supported as a source for links. Link generated by javascript or javascript document location changes are not supported. |
| application/pdf | PDF documents are supported in all versions up to 1.5 (Acrobat 6). Encrypted PDFs are not supported. Some PDF generators don't emit enough information to extract all the contained text as it appears on page, and PDFs with complex multicolumn layouts might result in text being extracted with a different paragraph or sentence ordering compared to the visual layout. Page boundaries are ignored. | |
| Microsoft Word (Windows and Mac) | application/msword | Microsoft Word files are supported for files generated by Word 97, Word 98 (Mac), Word 2000, Word XP and Word v.X (Mac), files generated by versions of Word previous to Word 97 or Word 98 (Mac) are currently not supported. Page boundaries are ignored. Embedded documents are currently not supported. |
| Rich Text Format (RTF) | text/rtf | RTF files are supported in all versions up to RTF 1.6. Page boundaries are ignored |
| email message formatted as RFC822 | message/rfc822 | Message files are supported in their raw form, as they are transmitted among mail servers, returned by POP3 servers, stored in the Unix maildir format and in Microsoft Outlook .eml files. The Unix mbox format, the Netscape mail format and Qualcomm Eudora's mail format are composed of a sequence of RFC822 messages, and can be imported after splitting the mailbox in the single messages. Message file text is imported entirely, and any attached message, or document in any of the supported formats is imported recursively. The message text and all the recognized attachments are indexed as a single document. |
For all type of documents searchbox supports the following charsets:
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
windows-1257
windows-1258
latin-1
us-ascii
iso-8859-1
iso-8859-2
iso-8859-3
iso-8859-4
iso-8859-5
iso-8859-6
iso-8859-7
iso-8859-8
iso-8859-9
iso-8859-10
iso-8859-11
iso-8859-13
iso-8859-14
iso-8859-15
iso-8859-16
cp437
cp737
cp775
cp850
cp852
cp855
cp857
cp860
cp861
cp862
cp863
cp864
cp865
cp866
cp869
cp874
utf-8