searchbox 2

User's Guide

release 2.2.0


Table of Contents

1. The searchbox documentation
User's Guide
Programmer's Reference Guide
CookBook
2. Introduction
What's inside the "box"
How to Use the Manuals
Inside This Manual
Hardware and Software Requirements
3. Getting Started
Making a Deployment Plan
Multiplatform support
The Upgrading procedure
Upgrade from searchbox 1.x to searchbox 2.2.x
Upgrade from previous searchbox 2.x releases
The Installation procedure
Windows 2000/XP
Requesting the activation key under Windows
Installing the searchbox activation key under Windows
Windows 2003 Server
Linux
Installing searchbox Control Panel
Requesting the activation key under Linux
Installing the searchbox activation key under Linux
Mac OS X
Requesting the activation key under Mac OS X
Installing the searchbox activation key under Mac OS X
Uninstalling searchbox under Mac OS X
Starting and stopping searchbox
Windows
Start
Stop
Status
Linux
Automatic Start and Stop
Start
Stop
Status
Mac OS X
Automatic Start and Stop
Start
Stop
Status
4. Notes for Search Engine beginners
Introducing Search Engine Robots & Spiders
How Robots Follow Links to Find Pages
Breadth-First Crawling
Depth-First Crawling
Other Indexing Issues
Spidering Depth
Server Load
Robots and private data
Password-Protected Pages
Encrypted Data
Communicating With Robots
User Agents in Web Server Logs
Robots.txt
The indexing process
Details on Following Links and Crawling Sites
The Site Map Solution and its Limitations
Frames and Framesets
JavaScript Pages
JavaScript Menus and Lists
Redirecting Links
Other Web Interfaces
Graphic Text
Plug-ins
Acrobat
Java
ActiveX
Non-Text File Formats
Dealing with Dynamic Data
Cookies
Dynamic Web Applications and Black Holes
Detecting Duplicate Pages
Updating Indexes
Update Schedules
Locating changed pages
5. searchbox essentials
Basic concepts
Digital Source
Archive
Collection
Watch
Metadata Template
The Plugin System
How Pieces stay together
A "Black Box" Approach for Content Gathering
The Search Engine perspective
The bridging brick
The Indexing/Querying Service
The Feeding Service
The Alerting Service
Rendering: revealing document features
Is searchbox a DBMS?
Beyond full-text indexing
Multichannel Syndication
Multiple document formats
Supported charset
6. The Control Panel
Launching the Control Panel
Configuring a new Engine endpoint
Switching between Engines
7. Users Management and ACLs
Users
Groups
Root ACL
ACL
8. Gathering
Creating a new Source
Adding a new Seed to a Source
Web site
FTP site
Gopher site
Usenet site
Filesystem
Mailbox
WebDav share
SMB share
ODBC
Other
Configuring the Gathering Depth Limit
robots.txt checking activation
Gathering "side metadata"
Collecting metadata embedded in HTML documents
Configuring the authentication method
Basic authentication
Cookie authentication
SSL Certificate authentication
Excluding portions of HTML from gathering
Configuration of a Fetching Plugin
Activation of a Fetching Plugin for a Source
Creating a custom gathering filter
Creating a new Archive
Checking status of current gathering activity
Gathering control
Manual reprocessing of documents
Resetting the content of an Archive
Accessing to gathering logs
Exporting gathering logs
Configuring Gathering limits
Scheduling automatic Gathering
Configuring the Garbage Collector
Making a query on Archive
Showing a document from Archive
Text FFF view
Metadata FFF view
Raw FFF view
Manual add/remove documents to/from an Archive
Getting the ID of a document
9. Rendering
Exploiting Metadata
The "Focuseek Flexible Format" (FFF)
On-line Metadata Injection: The Rendering Plugins Pipeline
Configuration of a Rendering Plugin
Activation of a Rendering Plugin
10. Editing
Off-line Metadata Injection: Metadata Templates
11. Publishing
Creating a new Collection
Making a query on a Collection
The query syntax
Words
Metadata
Query operators
Combining queries using parentheses
Some final points on syntax
Creating a new Watch
Setting Watch freshness
Creating a new notificator for Watch
Browsing Watch results
Viewing Watch results into an RSS aggregator
The searchbox Enterprise Search Portal (ESP)
Accessing to ESP
Make a query
Group results by different criteria
Sort query results
Set a time window for query results
Expand & Collapse groups
Getting more info
12. Bundled plugins configuration
The SLR plugin
The URL splitter plugin
The static plugin dll
The regular expressions plugin dll
The ODBC plugin dll
The odbc URL scheme
Authentication
Configuration values
The fake mime plugin dll and its pre-packaged plugin configurations
Configuration values
13. Engine administration
Import/Export
Windows
Unixes
Reindex
Windows
Unixes
Reset
Windows
Unixes
Optimize the index
Windows
Unixes
Check Status
Crash recovery
Datadir change
Global configuration parameters
searchbox process identity
Program and data location
Pidfile location (unix only)
Number of processes
Platform access
Proxy access
Notifications
Logging
Default User agent
Index optimization
Index flushes
DB Sync level
Internal HTTP Web Services
Enterprise Search Portal
Worker timeout
Minimum disk space
Global crawl control
Handling of <meta name="searchbox-xxxx"> tags in HTML documents
14. Troubleshooting

List of Figures

3.1. searchbox deployment environment
3.2. Windows Installer Step 1
3.3. Windows Installer Step 2
3.4. Windows Installer Step 3
3.5. Windows Installer Step 4
3.6. Windows Installer Step 5
3.7. Windows Installer Step 6
3.8. Windows Installer Step 7
3.9. Windows Installer Step 8
3.10. Windows Installer Step 9
3.11. Windows Installer Step 10
3.12. The Control Panel icon
3.13. The Control Panel at its first launch
3.14. Mac OS X Installer Step 1
3.15. Windows Installer Step 2
3.16. Windows Installer Step 3
3.17. Windows Installer Step 4
3.18. Windows Installer Step 5
3.19. Windows Installer Step 6
3.20. The Control Panel icon
3.21. The Control Panel at its first launch
4.1. Breadth-First Crawing scheme
4.2. Depth-First Crawling
5.1. The searchbox functional schema
5.2. The searchbox range of action
5.3. The three dimensional structure of a searchbox Digital Source: the (x,y) space and its time evolution (t)
5.4. The Collection as union of many Archives
5.5. The Watch for Content Monitoring
5.6. How searchbox concepts are connected each to the others
5.7. The Content Gathering component as bridge between content providers and client application
5.8. Fetched documents
5.9. Rendered documents
5.10. Documents retrieval
5.11. The rendering pipeline
5.12. A document as a composition of layers
6.1. The Control Panel application
6.2. The initial Preference window
6.3. Adding a new Engine endpoint
6.4. Fast Engine switch
7.1. The User and Security menu item
7.2. Available tabs of the security window
7.3. New user window
7.4. New group window
7.5. The Root ACL access window
7.6. The ACL Tab
8.1. The Sources Tab
8.2. A new unconfigured Source
8.3. Setting Name and Description of a new Source
8.4. The Seed box
8.5. The Seed window
8.6. Web site Seed
8.7. FTP site Seed
8.8. Gopher site Seed
8.9. Usenet site Seed
8.10. Filesystem Seed
8.11. Mailbox Seed
8.12. WebDav share Seed
8.13. SMB share Seed
8.14. ODBC Seed
8.15. ODBC Seed: no placeholder warning
8.16. ODBC Seed: enabling the odbc plugin
8.17. ODBC Seed: no plugin warning
8.18. "other" Seed
8.19. The list of active Seeds
8.20. No configured filters warning
8.21. Source configured with default inclusion filters
8.22. Depth Limit
8.23. Depth Limit parameter
8.24. robots.txt parameter
8.25. Has side-by-side metadata option
8.26. Authentication methods
8.27. Cookie authentication details
8.28. "Exclude text between" option
8.29. The Plugin status panel
8.30. The new plugin window
8.31. A new odbc type plugin
8.32. The plugin configuration panel
8.33. A freshly configured plugin
8.34. The Archive Tab
8.35. List of active sources
8.36. A new unconfigured Archive
8.37. Gathering infos
8.38. Start/Stop Gathering buttons
8.39. The reprocessing button
8.40. The Reset button
8.41. The View gathering logs button
8.42. The Gathering Logs windows
8.43. Gathering Logs export
8.44. Gathering limits configuration
8.45. Gathering periodicity: hourly
8.46. Gathering periodicity: daily
8.47. Gathering periodicity: weekly
8.48. Gathering periodicity: monthly
8.49. The Garbage Collector configuration
8.50. Browsing Archive content
8.51. Query Weights
8.52. The Context Box
8.53. Show document buttons
8.54. Authentication for searchbox cache
8.55. Cached document address
8.56. Text view of FFF
8.57. Metadata view of FFF
8.58. Raw view of FFF
8.59. Add and Remove buttons
8.60. Manual add parameters
8.61. Copy ID button
9.1. The Plugins Tab
9.2. The default plugins
9.3. New Plugin creation window
9.4. Inherit Plugin creation window
9.5. Available Rendering plugin
10.1. The Templates Tab
10.2. Popup window for new key:value adding
10.3. The list of configured static Metadata Template
10.4. Adding/Removing a Metadata Template to/from a single document
11.1. The Collections Tab
11.2. The built-in Collection
11.3. A new Collection
11.4. Browsing Collection content
11.5. The Watches Tab
11.6. Choose the Collection to use with a Watch
11.7. The Info Tab of Watch configuration panel
11.8. Watch notification window
11.9. SOAP Notification detail level
11.10. Notification detail level
11.11. The Notificator Recipient list
11.12. Watch Browsing
11.13. The Copy RSS button
11.14. The Mozilla Thunderbird RSS configuration
11.15. The Enterprise Search Portal
13.1. The Server status window of Control Panel

List of Tables

7.1. Root ACL schema
8.1. Sliceids
12.1. PluginDLLs, plugins and docfilters
12.2. Languages supported by the SLR pluginDLL
12.3. Fields for static metadata DocFilter configuration
12.4. Fields for regular expressions metadata DocFilter configuration
12.5. odbc URL structure
12.6. ODBC plugin Rules line fields
12.7. ODBC plugin Rules line types
12.8. ODBC plugin conversion values

List of Examples

11.1. A query containing a single word
11.2. Some queries using wildcards
11.3. Some complex wildcards queries
11.4. Querying for specific metadata
11.5. Selecting documents based on a range of metadata values
11.6. Phrase search
11.7. Phrase search with sloppyness
11.8. Some simple boolean queries
11.9. Using parentheses
11.10. Using special characters in the query
11.11. Querying tokenized and normalized metadata