Benchmarks

searchbox 2.0 is a high-end system from the performance point of view too.

The following benchmarks are made using some well known documents database often cited in papers submitted to international conferences and magazines about information retrieval. The hardware configuration used for tests is: Apple PowerMac G5 single CPU 1.8Ghz, 1Gb Ram, HD SATA 250Gb, Mac OS X 10.3

CiteSeer (March 2004 snapshot)

CiteSeer, in the past known as ResearchIndex, is a public specialty search engine and digital library that was created by researchers Steve Lawrence, Kurt Bollacker and Lee Giles while they were at the NEC Research Institute (now NEC Labs), Princeton, NJ, USA. CiteSeer crawls for and harvests academic scientific documents and uses autonomous citation indexing to permit querying by citation or by document. Currently, it is hosted on the Web at the School of Information Sciences and Technology, The Pennsylvania State University and has over 700,000 documents, primarily in the fields of computer and information science and engineering. Every document of CiteSeer archive is present in more than one format (PS, PDF, etc.), for our test only PDF version of document is considered.

ParameterValue
Document formatPDF
Total number of documents404,517
Total number of correctly processed documents394,498
Total size of documents106.25 GB
Indexing total time90.5 hours
Indexing speed in docs/sec1.2 docs/sec.
Indexing speed in Gb/hour1.17 GB/hour

WT10g (March 2000)

The 1.69 million document WT10g collection is proposed as a multi-purpose testbed for experiments with these attributes, in distributed IR, hyperlink algorithms and conventional ad hoc retrieval. WT10g was constructed by selecting from a superset of documents in such a way that desirable corpus properties were preserved or optimized. These properties include: a high degree of inter-server connectivity, integrity of server holdings, inclusion of documents related to a very wide spread of likely queries, and a realistic distribution of server holding sizes.

ParameterValue
Document formatHTML
Total number of documents1,282,276
Total number of correctly processed documents1,282,176
Total size of documents13 Gb
Indexing total time36 hours
Indexing speed in docs/sec9.8 docs/sec.
Indexing speed in Gb/hour0.36 Gb/hour