Benchmarks
searchbox 2.0 is a high-end system from the performance point of view too.
The following benchmarks are made using some well known documents database often cited in papers submitted to international conferences and magazines about information retrieval. The hardware configuration used for tests is: Apple PowerMac G5 single CPU 1.8Ghz, 1Gb Ram, HD SATA 250Gb, Mac OS X 10.3
CiteSeer (March 2004 snapshot)
CiteSeer, in the past known as ResearchIndex, is a public specialty search engine and digital library that was created by researchers Steve Lawrence, Kurt Bollacker and Lee Giles while they were at the NEC Research Institute (now NEC Labs), Princeton, NJ, USA. CiteSeer crawls for and harvests academic scientific documents and uses autonomous citation indexing to permit querying by citation or by document. Currently, it is hosted on the Web at the School of Information Sciences and Technology, The Pennsylvania State University and has over 700,000 documents, primarily in the fields of computer and information science and engineering. Every document of CiteSeer archive is present in more than one format (PS, PDF, etc.), for our test only PDF version of document is considered.
| Parameter | Value |
| Document format | |
| Total number of documents | 404,517 |
| Total number of correctly processed documents | 394,498 |
| Total size of documents | 106.25 GB |
| Indexing total time | 90.5 hours |
| Indexing speed in docs/sec | 1.2 docs/sec. |
| Indexing speed in Gb/hour | 1.17 GB/hour |
WT10g (March 2000)
The 1.69 million document WT10g collection is proposed as a multi-purpose testbed for experiments with these attributes, in distributed IR, hyperlink algorithms and conventional ad hoc retrieval. WT10g was constructed by selecting from a superset of documents in such a way that desirable corpus properties were preserved or optimized. These properties include: a high degree of inter-server connectivity, integrity of server holdings, inclusion of documents related to a very wide spread of likely queries, and a realistic distribution of server holding sizes.
| Parameter | Value |
| Document format | HTML |
| Total number of documents | 1,282,276 |
| Total number of correctly processed documents | 1,282,176 |
| Total size of documents | 13 Gb |
| Indexing total time | 36 hours |
| Indexing speed in docs/sec | 9.8 docs/sec. |
| Indexing speed in Gb/hour | 0.36 Gb/hour |
