Xapian and Omega

Xapian is an Open Source Probabilistic Information Retrieval library, released under the GPL.
Download

Xapian and Omega Ranking & Summary

Advertisement

  • Rating:
  • License:
  • GPL
  • Price:
  • FREE
  • Publisher Name:
  • Xapian Team
  • Publisher web site:
  • http://www.xapian.org/

Xapian and Omega Tags


Xapian and Omega Description

Xapian is an Open Source Probabilistic Information Retrieval library, released under the GPL. Xapian is an Open Source Probabilistic Information Retrieval library, released under the GPL. Xapian iss written in C , with bindings to allow use from other languages (Perl, Java, Python, PHP, and TCL are currently supported; Guile and C# are being worked on).Xapian is designed to be a highly adaptable toolkit to allow developers to easily add advanced indexing and search facilities to their own applications.If you're after a packaged search engine for your website, you should take a look at Omega, which is an application we supply built upon Xapian. But unlike most other website search solutions, Xapian's versatility allows you to extend Omega to meet your needs as they grow. Here are some key features of "Xapian and Omega": · Free Software/Open Source - licensed under the GPL. · Highly portable - runs on many Linux, MacOS X, many other Unix platforms, and Microsoft Windows. · Written in C . Perl bindings are available in the module Search::Xapian on CPAN. Java JNI bindings are included in the xapian-bindings module. We also support SWIG which can generate bindings for 13 languages. At present those for Python, PHP4, and TCL are working. Guile and C# are being worked on. · Ranked probablistic search - important words get more weight than unimportant words, so the most relevant documents are more likely to come near the top of the results list. · Relevance feedback - given one or more documents, Xapian can suggest the most relevant index terms to expand a query, suggest related documents, categorise documents, etc. · Phrase and proximity searching - users can search for words occuring in an exact phrase or within a specified number of words, either in a specified order, or in any order. · Full range of structured boolean search operators ("stock NOT market", etc). The results of the boolean search are ranked by the probablistic weights. Boolean filters can also be applied to restrict a probabilistic search. · Supports stemming of search terms (e.g. a search for "football" would match documents which mention "footballs" or "footballer"). This helps to find relevant documents which might otherwise be missed. Stemmers are currently included for Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish. · Supports database files > 2GB - essential for scaling to large document collections. · Platform independent data formats - you can build a database on one machine and search it on another. · Allows simultaneous update and searching. New documents become searchable right away. As well as the library, we supply a number of small example programs, and a larger application - an indexing and CGI-based application called omega: · The indexer supplied can index HTML, PHP, PDF, PostScript, and plain text. Adding support for indexing other formats is easy where conversion filters are available (e.g. Microsoft Word). This indexer works using the filing system, but we also provide a script to allow the htdig web crawler to be hooked in, allowing remote sites to be searched using Omega. · You can also index data from any SQL or other RDBMS supported by the Perl DBI module. That includes MySQL, PostgreSQL, SQLite, Sybase, MS SQL, LDAP, and ODBC. · CGI search front-end supplied with highly customisable appearance. This can also be customised to output results in XML or CSV, which is useful if you are dynamically generating pages (e.g. with PHP or mod_perl) and just want raw search results which you can process in your own page layout code. What's New in This Release: API: · Xapian::Document no longer ever stores empty values explicitly. This wasn't intentional behaviour, and how this case was handled wasn't documented. The amended behaviour is consistent with how user metadata is handled. This change isn't observable using Document::get_value(), but can be noticed when iterating with Document::values_begin(), using Document::values_count(), or trying to delete the value with Document::remove_value(). testsuite: · Fix testcase scaleweight4 not to fail on x86 when compiled with -O0. The problem was in the testcase code, and was caused by excess precision in intermediate FP values. · Testcases which check that operations have the expected O(...) behaviour no check CPU time instead of wallclock time on most platforms, which should eliminate occasional failures due to load spikes from other processes. · (ticket#308) · Fix test failures due to SKIP_TEST_FOR_BACKEND("inmemory") not skipping when it should due to comparing char strings with == (on trunk the return value being tested is std::string rather than const char *). · Improve test coverage in several corner cases. · Fix testcase consistency2 to actually be run (fortunately it passes). · In the generated testcases, call get_description() on the default constructed object of each class to make sure that works (and doesn't try to dereference NULL, or fail some assertion, etc). All currently checked classes are fine - this is to avoid future regressions or such problems with new classes. · In the test coverage build, use "--coverage" instead of "-fprofile-arcs -ftest-coverage". · The test harness now has the inmemory backend flagged as supporting user-specified metadata (apart from iteration over metadata keys). matcher: · If a query contains a MatchAll subquery, check for it before checking the other terms so that the loop which checks how many terms match can exit early if they all match. · When an OR or ANY_MAYBE decayed to an AND, we were carefully swapping the children for maximum efficiency, but the condition was reversed so we were in fact making things worse. This was noticed because it was resulting in the same query running faster when more results were asked for! · Only build the termname to termfreq and weight map for the first subdatabase instead of rebuilding it for each one. Also don't copy this map to return it. This should speed up searches a little, especially those over multiple databases. · If a submatcher fails but ErrorHandler tells us to continue without it, we just use a NULL pointer to stand in rather than allocating a special dummy place-holder object. · Remove AndPostList, in favour of MultiAndPostList. AndPostList was only used as a decay product (by AndMaybePostList and OrPostList), and doesn't appear to be any faster. Removing it reduces CPU cache pressure, and is less code to maintain. · Call check() instead of skip_to() on the optional branch of AND_MAYBE. flint backend: · Fix a bug in TermIterator::skip_to() over metadata keys. remote backend: · Fix xapian-tcpsrv --interface option to work on MacOS X (ticket#373). · Fix typo which caused us to return the docid instead of the maximum weight a document from a remote match could return! This could have led to wrong results when searching multiple databases with the remote backend, but probably usually didn't matter as with BM25 the weights are generally small (often all < 1) while docids are inevitably >= 1. inmemory backend: · The inmemory backend doesn't support iterating over metadata keys. Trying to do so used to give an empty iteration, but has now been fixed to throw · UnimplementedError (and this limitation has now been documented). build system: · Remove a lot of unused header inclusions and some unused code which should make the build faster and slightly smaller. · Fix to compile under --disable-backend-flint, --disable-backend-remote, and --disable-backend-inmemory. · Don't remove any built sources in "make clean" even under --make-maintainer-mode as that breaks switching a tree away from maintainer-mode with: make distclean;./configure · configure: Enable more GCC warnings - "-Woverloaded-virtual" for all versions, "-Wstrict-null-sentinel" for 4.0+, "-Wlogical-op -Wmissing-declarations" for 4.3+. Notably "-Wmissing-declarations" caught that consistency2 wasn't being run. · Internally, fix the few places where we pass std::string by value to pass by const reference instead (except where we need a modifiable copy anyway) as benchmarking shows that const reference is slightly faster and generates less code with GCC's reference counted std::string implementation - with a non-reference counted implementation, const reference should be much faster. (ticket#140) documentation: · INSTALL: We no longer regularly test build with GCC 2.95.4 and we're raising the minimum GCC version required to 3.1 for Xapian 1.1.x. · Document what passing maxitems=0 to Enquire::get_mset() does. · docs/queryparser.html: Add examples of using a prefix on a phrase or subexpression. · Correct doxygen comments for user metadata functions: Database::get_metadata() throw UnimplementedError but WritableDatabase::set_metadata() can. · Document that Database::metadata_keys_begin() returns an end iterator if the backend doesn't support metadata. · HACKING: Update the list of Debian/Ubuntu packages needed for a development environment. debug code: · Fix build with --enable-debug. · Added some more assertions.


Xapian and Omega Related Software