-
Notifications
You must be signed in to change notification settings - Fork 0
elastic
#Elastic
This wikipedia entry aims to explain how to setup and use Elasticsearch on Sabe-Online website.
The following topics have been addressed:
##Requirements
###Server Requirements
First start by installing Java and Elasticsearch, as explained in the official documentation here. Probably, it will be better to apt-get ElasticSearch, since I locally used brew and it worked.
Currently, the Elasticsearch version used for development was 1.3.0. You can check the version using elasticsearch -v:
$ elasticsearch -v
Version: 1.3.0, Build: 1265b14/2014-07-23T13:46:36Z, JVM: 1.8.0_40
Note: Java Virtual Machine version is also listed
After elastic is installed and tested, it should respond on port 9200, for example:
http://localhost:9200/_stats should return something like:
{
shards: {
total: 1,
successful: 1,
failed: 0
}
}
This means your server is ready to go and you can start working on the code itself, well done!
###Gem Requirements
Add the following gems to your gemfile:
gem 'elasticsearch-model', '~> 0.1.7'
gem 'elasticsearch-rails', '~> 0.1.7'
Note: The versions listed above where used for development, change at your own risk!
Run bundle install and you're ready to go!
##Understanding Elasticsearch
###Elastic and Lucene
Elasticsearch is an open-source search engine built on top of Apache Lucene™, a full-text search-engine library. Lucene is arguably the most advanced, high-performance, and fully featured search engine library in existence today—both open source and proprietary.
But Lucene is just a library. To leverage its power, you need to work in Java and to integrate Lucene directly with your application. Worse, you will likely require a degree in information retrieval to understand how it works. Lucene is very complex.
Elasticsearch is also written in Java and uses Lucene internally for all of its indexing and searching, but it aims to make full-text search easy by hiding the complexities of Lucene behind a simple, coherent, RESTful API.
However, Elasticsearch is much more than just Lucene and much more than “just” full-text search. It can also be described as follows:
- A distributed real-time document store where every field is indexed and searchable
- A distributed search engine with real-time analytics
- Capable of scaling to hundreds of servers and petabytes of structured and unstructured data
And it packages up all this functionality into a standalone server that your application can talk to via a simple RESTful API, using a web client from your favorite programming language, or even from the command line.
Of course, Elasticsearch is no good for us unless we use Elasticsearch-rails.
###elasticsearch-rails
This repository contains various Ruby and Rails integrations for Elasticsearch:
- ActiveModel integration with adapters for ActiveRecord and Mongoid
- Repository pattern based persistence layer for Ruby objects
- Active Record pattern based persistence layer for Ruby models
- Enumerable-based wrapper for search results
- ActiveRecord::Relation-based wrapper for returning search results as records
- Convenience model methods such as search, mapping, import, etc
- Rake tasks for importing the data
- Support for Kaminari and WillPaginate pagination
- Integration with Rails' instrumentation framework
- Templates for generating example Rails application
Elasticsearch client and Ruby API is provided by the elasticsearch-ruby project.
###elasticsearch-model
The elasticsearch-model library builds on top of the the elasticsearch library.
It aims to simplify integration of Ruby classes ("models"), commonly found e.g. in Ruby on Rails applications, with the Elasticsearch search and analytics engine.
The library is compatible with Ruby 1.9.3 and higher.
##Sabe-online Implementation
Our implementation was based on several examples across the web, mostly official documentation, but unfortunately, there's not a single "perfect" tutorial to link here.
This will have to be it.
The base is pretty simple, it follow this simple structure:
# In: app/models/concerns/user_search.rb
#
module UserSearch
extend ActiveSupport::Concern
included do
include Elasticsearch::Model
include Elasticsearch::Model::Callbacks
after_touch() { ... }
index_name [...]
settings do
#...
mapping do
# ...
end
def as_indexed_json
# ...
end
def self.search(query)
# ...
end
end
methods ...
end
end
# In: app/models/user.rb
#
class User
include UserSearch
end
A concern is created, and when included, it will provide the including class with the necessary methods to be used by elasticsearch. This keeps the models clear of elastic pollution and makes managing easier.
###after_touch
If a related model is touched it will call the after_touch callback and run the given block of code. For example, to reindex users when the model Authentications is updated, we provide the following code:
#authentication.rb
belongs_to :user, touch: true
That will trigger our after_touch callback on User.rb
after_touch() { UsersIndexerWorker.perform_async(:index, self.id) }
This will allow rails to move on, since the callback will be performed async by Sidekiq, not forcing the website user to wait for the success of the indexation. The task contains something like:
class UsersIndexerWorker
include Sidekiq::Worker
sidekiq_options queue: 'elasticsearch'
def perform(operation, record_id)
case operation.to_s
when /index/
record = User.find(record_id)
record.__elasticsearch__.index_document
end
end
end
Simple relations can be done this way, more complex models with with relations with multiple indexed models need a not more work, for example:
class UserUpdatedWorker
include Sidekiq::Worker
sidekiq_options queue: 'elasticsearch'
def perform(changed, user_id)
@user = User.find(user_id)
changed.each do |column|
if (column == "first_name") || (column == "last_name") || (column == "company_id")
# @user.asked_questions.each(&:touch)
@asked_question_ids = AskedQuestion.where("user_id = ?", user_id).map(&:id)
@asked_question_ids.each do |asked_question_id|
AskedQuestionsIndexerWorker.perform_async(:index, asked_question_id)
end
@challenge_answer_ids = ChallengeAnswer.where("user_id = ?", user_id).map(&:id)
@challenge_answer_ids.each do |challenge_answer_id|
ChallengeAnswersIndexerWorker.perform_async(:index, challenge_answer_id)
end
end
end
end
end
This will reindex AskedQuestion and ChallengeAnswer that have that user, not all User fields are of interest to all related models, not all related model entries need to be indexed, this method pretty much covers it all.
Of course, this method is being called on a callback:
after_update :reindex_relations
This will wait for the commit success message before triggering the method:
def reindex_relations
UserUpdatedWorker.perform_async(self.changed, self.id) if changed?
end
Note: All related models need to be covered. If there is a relation that is indexed with the user, it has to call its users re-indexation.
###index_name
This is very important for testing, not as much for production. Testing with elastic implies creation and deletion of entire indexes, and obviously you don't want to destroy all your development data. In production, unless something is really wrong and you have two environments, you wont need this line:
index_name [model_name.collection, Rails.env].join('_')
This will append the environment name to the model name, creating indexes like users_development or users_test. This way the names wont collide and your elasticsearch development data is safe from the evil rspec claws.
###Settings
This is were you configure almost everything elasticsearch related. For simplicity's sake, I will extract almost everything to it's own chapter, but keep in mind most goes inside the settings block.
# configuration
settings index: { number_of_shards: 1, number_of_replicas: 0 },
# custom analyzer
analysis: {
analyzer: {
folding: {
tokenizer: 'standard',
filter: [ "lowercase", "asciifolding" ]
}
}
} do
...
end
end
The number_of_shards sets the number of primary Lucene instances allocated for this index. More primary shards means scalability, but also means more work with the queries, for example, lets say we have 450 users indexed, it will send 225 to each shard. Then on a paginated (50) ordered query it will have to get all matches from each shard (50), then join everything by order again and get the first 50. If its on the second page, it will have to get 100 from each shard, then merge the results (200) then get the first 100, ignore the first 50, responde the second 50 and ignore the remaining 100. I did not get much into this, but I believe this will require extra parameters on our configuration, but I also believe we wont have enough records to require more than one primary shard for a while.
Note: You cannot change the number of primary shards after you created the index. To change it, you need to reindex.
The number_of_replicas sets the number of mirror shards, or replicas, for the index. A replica is a copy of the primary shard, and has two purposes:
- Increase failover: a replica shard can be promoted to a primary shard if the primary fails
- Increase performance: get and search requests can be handled by primary or replica shards. By default, each primary shard has one replica, but the number of replicas can be changed dynamically on an existing index. A replica shard will never be started on the same node as its primary shard.
Note: A
nodeis a running instance of elasticsearch which belongs to a cluster. Multiple nodes can be started on a single server for testing purposes, but usually you should have one node per server. At startup, a node will use unicast (or multicast, if specified) to discover an existing cluster with the same cluster name and will try to join that cluster.
After this initial part, you can define custom analysers that will be used on your mapping. In this case, I created a custom analyzer (with z, not with s, really) named folding that uses 2 filters, lowercase and asciifolding. More information about custom analyzers can be found here.
Note: Custom analyzers are the way to go. Basic indexation provides little to no use and fields need to be indexed for its specific needs.
###mappings
Mappings allow you to tailor make indexation for specific fields that require extra attention. Consider the following code:
# mapping to index fields that require special attention
mapping do
indexes :name, type: 'multi_field' do
indexes :name, analyzer: 'folding'
indexes :tokenized, analyzer: 'simple'
indexes :name_raw, index: :not_analyzed
end
indexes :email, type: 'multi_field' do
indexes :email, analyzer: 'folding'
indexes :tokenized, analyzer: 'simple'
indexes :email_raw, index: :not_analyzed
end
indexes :authentication do
indexes :provider, analyzer: 'simple'
end
# this is the same as default behaviour
indexes :created_at, type: 'date', index: :not_analyzed
end
Lets take a look at the name field, it's indexed 3 times. The first time, because I needed all the special characters removed, for example, so a search for 'sónia' would match 'sonia', then because I also needed 'sónia' to match 'sónia' and finally because since Lucene is reverse indexed, I needed field where I could call sort, else it would randomly sort 'Sónia Reis' by her first or her last name.
Also, for some weird reason I cannot explain, in testing, I had to manually index created_at, else the field would not be available and the default sorting (by created_at) would crash the test.
To check the mappings on an index you can point your browser to:
http://localhost:9200/users_development/user/_mapping
Further reading about mappings can be made here.
###as_indexed_json
This method defines the _source field in an elasticsearch match. Also, if no mapping is defined, it will index those fields with standard analyzers and will also assume what those fields represent according to the data they receive, for example, it will quickly learn what created_at is:
created_at: {
type: "date",
format: "dateOptionalTime"
},
Consider the following code:
# this is what the search engine will reply
def as_indexed_json(options={})
self.as_json(
methods: [:id_as_string, :name, :roles_count],
only: [:id_as_string, :name, :email, :created_at, :current_sign_in_at, :roles, :company_id, :roles_count],
include: {
authentications: { only: [:provider]}
}
)
end
Note:
authenticationsis a nested model, it's underinclude.
As you are probably aware, our users don't have a name, they have first_name and last_name, and a method that joins them. You can pass method names so elasticsearch-rails knows to use them when it runs. One method here that is really helpful is the id_as_string, because it allows me to search for user ids with a simple search form. (hint: string vs long)
###self.search
This is the method whose return is converted into the JSON that is sent to elasticsearch. It basically generates an hash with arrays inside, carrying all the information we need to get the results we want. The method used in users currently looks like:
# this method sets the search parameters, both query and filters
def self.search(params)
search_info = {
query: {},
filter: {}
}
if params[:search_text] && params[:search_text] != ""
query = params[:search_text]
search_info.merge!({
query: {
multi_match: {
query: query,
fields: ['id_as_string^15', 'name', 'email', 'authentications.provider', 'raw_email']
}
}
})
else
search_info.merge!({
query: {
match_all: {}
}
})
end
if (params[:start_date] || params[:end_date]) &&
(params[:start_date] != "" || params[:end_date] != "")
start_date = DateTime.parse(params[:start_date])
end_date = DateTime.parse(params[:end_date]).end_of_day
search_info[:filter][:bool] ||= {}
search_info[:filter][:bool][:must] ||= []
search_info[:filter][:bool][:must] |= [{ range: { created_at: { gte: start_date, lte: end_date } } }]
end
if params[:user_type] && params[:user_type] != ""
qfilter = params[:user_type]
search_info[:filter][:bool] ||= {}
search_info[:filter][:bool][:must] ||= []
search_info[:filter][:bool][:must] |= [{ term: { "roles" => qfilter } }]
if qfilter == User::ROLES[:user].to_s
search_info[:filter][:bool][:must] |= [{ term: { "roles_count" => qfilter.length.to_s } }]
end
end
if params[:social_type] && params[:social_type] != ""
qfilter = params[:social_type]
search_info[:filter][:bool] ||= {}
if qfilter == "none"
search_info[:filter][:bool][:must] ||= []
search_info[:filter][:bool][:must] |= [{ missing: { "field" => "authentications.provider" } }]
else
search_info[:filter][:bool][:must] ||= []
search_info[:filter][:bool][:must] |= [{ term: { "authentications.provider" => qfilter } }]
end
end
if params[:company_id] && params[:company_id] != ""
qfilter = params[:company_id]
search_info[:filter][:bool] ||= {}
if qfilter == "-1"
search_info[:filter][:bool][:should] ||= []
search_info[:filter][:bool][:should] |= [{ missing: { "field" => "company_id" } }]
search_info[:filter][:bool][:should] |= [{ term: { "company_id" => 0 } }]
else
search_info[:filter][:bool][:must] ||= []
search_info[:filter][:bool][:must] |= [{ term: { "company_id" => qfilter.to_i } }]
end
end
if params[:sort] && params[:sort] != ""
column = params[:sort].split('-').last
order = params[:sort].start_with?('-') ? 'desc' : 'asc'
search_info.merge!(sort: [ column => { order: order }])
else
search_info.merge!(sort: [ created_at: { order: :desc }])
end
__elasticsearch__.search(search_info)
end