0

Sphinx Vs Lucile

In the past few weeks I've been researching how to improve db queries for scaling purposes and having in mind the MySQL sharding approach.

So, we are looking for a stand-alone full-text search server with the following properties:
.Must operate as a stand-alone server that can serve search requests from multiple clients 
.Must be able to do "bulk indexing" by indexing the result of an SQL query: say "SELECT id, text_to_index FROM dbtable;" 
.Must be free software and must run on Linux with MySQL as the database 
.Must be fast (rules out MySQL's internal full-text search)

The alternatives I've found that have these properties are:
Elasticsearch or SOLR (based on Lucene) 
ElasticSearch is a distributed, RESTful, free/open source search server based on Apache Lucene. It is developed by Shay Banon and is released under the Apache Software License. ElasticSearch is developed in Java.
- it's a nice framework (model) for handling groups of search nodes and is built on top of Lucene 
- REST based, distributed search engine powered by the excellent Lucene library. The built in JSON + HTTP API provides an elegant platform perfect for integrating with

Sphinx Search Engine
Sphinx is a free software search engine designed with indexing database content in mind. It currently supports MySQL, PostgreSQL, and ODBC-compliant databases as data sources natively.

Similarities

Both Elasticsearch and Sphinx are similar to MySQL FULLTEXT search (think Google),
Both satisfy all requirements. They're fast and designed to index and search large bodies of data efficiently.
Both have a long list of high-traffic sites using them (Elasticsearch, Sphinx)
Both offer commercial support. (Elasticsearch, Sphinx)
Both offer client API bindings for several platforms/languages (Sphinx, Elasticsearch)
Both can be distributed to increase speed and capacity (Sphinx, Elasticsearch)

Here are some differences
Elasticsearch, being an Apache project, is obviously is Apache2-licensed. Sphinx is GPLv2. This means that if you ever need to embed or extend (not just "use") Sphinx in a commercial application, you'll have to buy a commercial license. 
Elasticsearch is built on top of Lucene, which is a proven technology over 7 years old with a huge user base (this is only a small part). Whenever Lucene gets a new feature or speedup, Elasticsearch gets it too. Many of the devs committing to Elasticsearch are also Lucene committers.
Sphinx integrates more tightly with RDBMSs, especially MySQL. 
Elasticsearch can index proprietary formats like Microsoft Word, PDF, etc. Sphinx can't. 
Elasticsearch comes with a spell-checker out of the box. 
Elasticsearch comes with facet support out of the box. Faceting in Sphinx takes more work. Sphinx doesn't allow partial index updates for field data. In Sphinx, all document ids must be unique unsigned non-zero integer numbers. 
Elasticsearch doesn't even require a unique key for many operations, and unique keys can be either integers or strings. 
Elasticsearch supports field collapsing to avoid duplicating similar results.
Elasticsearch: distributed Search - Search could be performed on multiple shards / indexes and the results will be aggregated.
Elasticsearch: distributed Indexing - Given the set of nodes, Document should be routed to different shards / index. It could be based on round-robin or some way of hashing.
Elasticsearch: it adds and removes nodes and handles the request from the correct node.
Elasticsearch: each node saves its state in the shared storage. When a node comes back after failure or shutdown, It could recover its state from its replica group.
Sphinx doesn't seem to provide any feature like this.

Conclusion
For simple full text search setup, Sphinx is a better choice.
In order to customize the searches, Lucene/Elasticsearch (or Solr) is the better choice. Why?

At its core, a distributed Lucene solution needs to be sharded. ElasticSearch is a solution for exposing an indexing/search server over HTTP, it has a very advance distributed model, speaks natively JSON, and exposes many advance search features, all seamlessly expressed through JSON DSL.

Elasticsearch takes it to the next level with an architecture for creating modern realtime search applications. Percolation is an exciting and innovative feature that singlehandedly blows Sphinx and Solr right out of the water. Elasticsearch is scalable, speedy and a dream to integrate with.

ElasticSearch is very-very fast on the query side. It is also very powerful. The indexing side is very CPU and memory intensive and is an unfortunate side effect of having such a feature-rich, fast application. Nevertheless, I highly recommend ElasticSearch.

Which and why?
Use ElasticSearch if you intend to use it in your web-app(example-site search engine). It will definitely turn out to be great, thanks to its API. You will definitely need that power for a web-app.
Use Sphinx if you want to search through tons of documents/files real quick. It indexes real fast too. I would recommend not to use it in an app that involves JSON or parsing XML to get the search results. Use it for direct dB searches. It works great on MySQL.

Click to share thisClick to share this