elastic

#Elastic

This wikipedia entry aims to explain how to setup and use Elasticsearch on Sabe-Online website.
The following topics have been addressed:

Requirements

Server Requirements
Gem Requirements

Understanding Elasticsearch

Elastic and Lucene
elasticsearch-rails
elasticsearch-model

Sabe-online Implementation

after_touch
index_name
settings
mappings
as_indexed_json

##Requirements

###Server Requirements

First start by installing Java and Elasticsearch, as explained in the official documentation here. Probably, it will be better to apt-get ElasticSearch, since I locally used brew and it worked.

Currently, the Elasticsearch version used for development was 1.3.0. You can check the version using elasticsearch -v:

$ elasticsearch -v
Version: 1.3.0, Build: 1265b14/2014-07-23T13:46:36Z, JVM: 1.8.0_40

Note: Java Virtual Machine version is also listed

After elastic is installed and tested, it should respond on port 9200, for example:

http://localhost:9200/_stats should return something like:

{
  shards: {
    total: 1,
    successful: 1,
    failed: 0
  }
}

This means your server is ready to go and you can start working on the code itself, well done!

###Gem Requirements

Add the following gems to your gemfile:

gem 'elasticsearch-model', '~> 0.1.7'
gem 'elasticsearch-rails', '~> 0.1.7'

Note: The versions listed above where used for development, change at your own risk!

Run bundle install and you're ready to go!

##Understanding Elasticsearch

###Elastic and Lucene

Elasticsearch is an open-source search engine built on top of Apache Lucene™, a full-text search-engine library. Lucene is arguably the most advanced, high-performance, and fully featured search engine library in existence today—both open source and proprietary.

But Lucene is just a library. To leverage its power, you need to work in Java and to integrate Lucene directly with your application. Worse, you will likely require a degree in information retrieval to understand how it works. Lucene is very complex.

Elasticsearch is also written in Java and uses Lucene internally for all of its indexing and searching, but it aims to make full-text search easy by hiding the complexities of Lucene behind a simple, coherent, RESTful API.

However, Elasticsearch is much more than just Lucene and much more than “just” full-text search. It can also be described as follows:

A distributed real-time document store where every field is indexed and searchable
A distributed search engine with real-time analytics
Capable of scaling to hundreds of servers and petabytes of structured and unstructured data

And it packages up all this functionality into a standalone server that your application can talk to via a simple RESTful API, using a web client from your favorite programming language, or even from the command line.

Of course, Elasticsearch is no good for us unless we use Elasticsearch-rails.

###elasticsearch-rails

This repository contains various Ruby and Rails integrations for Elasticsearch:

ActiveModel integration with adapters for ActiveRecord and Mongoid
Repository pattern based persistence layer for Ruby objects
Active Record pattern based persistence layer for Ruby models
Enumerable-based wrapper for search results
ActiveRecord::Relation-based wrapper for returning search results as records
Convenience model methods such as search, mapping, import, etc
Rake tasks for importing the data
Support for Kaminari and WillPaginate pagination
Integration with Rails' instrumentation framework
Templates for generating example Rails application

Elasticsearch client and Ruby API is provided by the elasticsearch-ruby project.

###elasticsearch-model

The elasticsearch-model library builds on top of the the elasticsearch library.

It aims to simplify integration of Ruby classes ("models"), commonly found e.g. in Ruby on Rails applications, with the Elasticsearch search and analytics engine.

The library is compatible with Ruby 1.9.3 and higher.

##Sabe-online Implementation

Our implementation was based on several examples across the web, mostly official documentation, but unfortunately, there's not a single "perfect" tutorial to link here.
This will have to be it.

The base is pretty simple, it follow this simple structure:

# In: app/models/concerns/user_search.rb
#
module UserSearch
  extend ActiveSupport::Concern

  included do
    include Elasticsearch::Model
    include Elasticsearch::Model::Callbacks

    after_touch() { ... }

    index_name [...]

    settings do
      #...

      mapping do
        # ...
      end

      def as_indexed_json
        # ...
      end

      def self.search(query)
        # ...
      end
    end

    methods ...

  end
end

# In: app/models/user.rb
#
class User
  include UserSearch
end

A concern is created, and when included, it will provide the including class with the necessary methods to be used by elasticsearch. This keeps the models clear of elastic pollution and makes managing easier.

###after_touch

If a related model is touched it will call the after_touch callback and run the given block of code. For example, to reindex users when the model Authentications is updated, we provide the following code:

#authentication.rb
belongs_to :user, touch: true

That will trigger our after_touch callback on User.rb

after_touch() { UsersIndexerWorker.perform_async(:index, self.id) }

This will allow rails to move on, since the callback will be performed async by Sidekiq, not forcing the website user to wait for the success of the indexation. The task contains something like:

class UsersIndexerWorker

  include Sidekiq::Worker
  sidekiq_options queue: 'elasticsearch'


  def perform(operation, record_id)
    case operation.to_s
      when /index/
        record = User.find(record_id)
        record.__elasticsearch__.index_document
    end
  end

end

Simple relations can be done this way, more complex models with with relations with multiple indexed models need a not more work, for example:

class UserUpdatedWorker
  include Sidekiq::Worker
  sidekiq_options queue: 'elasticsearch'

  def perform(changed, user_id)
    @user = User.find(user_id)
    changed.each do |column|
      if (column == "first_name") || (column == "last_name") || (column == "company_id")
        # @user.asked_questions.each(&:touch)
        @asked_question_ids = AskedQuestion.where("user_id = ?", user_id).map(&:id)
        @asked_question_ids.each do |asked_question_id|
          AskedQuestionsIndexerWorker.perform_async(:index, asked_question_id)
        end
        @challenge_answer_ids = ChallengeAnswer.where("user_id = ?", user_id).map(&:id)
        @challenge_answer_ids.each do |challenge_answer_id|
          ChallengeAnswersIndexerWorker.perform_async(:index, challenge_answer_id)
        end
      end
    end
  end
end

This will reindex AskedQuestion and ChallengeAnswer that have that user, not all User fields are of interest to all related models, not all related model entries need to be indexed, this method pretty much covers it all.

Of course, this method is being called on a callback:

after_update :reindex_relations

This will wait for the commit success message before triggering the method:

def reindex_relations
  UserUpdatedWorker.perform_async(self.changed, self.id) if changed?
end

Note: All related models need to be covered. If there is a relation that is indexed with the user, it has to call its users re-indexation.

###index_name

This is very important for testing, not as much for production. Testing with elastic implies creation and deletion of entire indexes, and obviously you don't want to destroy all your development data. In production, unless something is really wrong and you have two environments, you wont need this line:

index_name [model_name.collection, Rails.env].join('_')

This will append the environment name to the model name, creating indexes like users_development or users_test. This way the names wont collide and your elasticsearch development data is safe from the evil rspec claws.

###Settings

This is were you configure almost everything elasticsearch related. For simplicity's sake, I will extract almost everything to it's own chapter, but keep in mind most goes inside the settings block.

# configuration
settings index: { number_of_shards: 1, number_of_replicas: 0 },
  # custom analyzer
  analysis: {
    analyzer: {
      folding: {
        tokenizer: 'standard',
        filter: [ "lowercase", "asciifolding" ]
      }
    }
  } do
    ...
  end
end

The number_of_shards sets the number of primary Lucene instances allocated for this index. More primary shards means scalability, but also means more work with the queries, for example, lets say we have 450 users indexed, it will send 225 to each shard. Then on a paginated (50) ordered query it will have to get all matches from each shard (50), then join everything by order again and get the first 50. If its on the second page, it will have to get 100 from each shard, then merge the results (200) then get the first 100, ignore the first 50, responde the second 50 and ignore the remaining 100. I did not get much into this, but I believe this will require extra parameters on our configuration, but I also believe we wont have enough records to require more than one primary shard for a while.

Note: You cannot change the number of primary shards after you created the index. To change it, you need to reindex.

The number_of_replicas sets the number of mirror shards, or replicas, for the index. A replica is a copy of the primary shard, and has two purposes:

Increase failover: a replica shard can be promoted to a primary shard if the primary fails
Increase performance: get and search requests can be handled by primary or replica shards. By default, each primary shard has one replica, but the number of replicas can be changed dynamically on an existing index. A replica shard will never be started on the same node as its primary shard.

Note: A node is a running instance of elasticsearch which belongs to a cluster. Multiple nodes can be started on a single server for testing purposes, but usually you should have one node per server. At startup, a node will use unicast (or multicast, if specified) to discover an existing cluster with the same cluster name and will try to join that cluster.

After this initial part, you can define custom analysers that will be used on your mapping. In this case, I created a custom analyzer (with z, not with s, really) named folding that uses 2 filters, lowercase and asciifolding. More information about custom analyzers can be found here.

Note: Custom analyzers are the way to go. Basic indexation provides little to no use and fields need to be indexed for its specific needs.

###mappings

Mappings allow you to tailor make indexation for specific fields that require extra attention. Consider the following code:

# mapping to index fields that require special attention
mapping do
 indexes :name, type: 'multi_field' do
   indexes :name,      analyzer: 'folding'
   indexes :tokenized, analyzer: 'simple'
   indexes :name_raw, index: :not_analyzed
 end

 indexes :email, type: 'multi_field' do
   indexes :email,     analyzer: 'folding'
   indexes :tokenized, analyzer: 'simple'
   indexes :email_raw, index: :not_analyzed
 end

 indexes :authentication do
   indexes :provider, analyzer: 'simple'
 end
 # this is the same as default behaviour
 indexes :created_at, type: 'date', index: :not_analyzed
end

Lets take a look at the name field, it's indexed 3 times. The first time, because I needed all the special characters removed, for example, so a search for 'sónia' would match 'sonia', then because I also needed 'sónia' to match 'sónia' and finally because since Lucene is reverse indexed, I needed field where I could call sort, else it would randomly sort 'Sónia Reis' by her first or her last name.

Also, for some weird reason I cannot explain, in testing, I had to manually index created_at, else the field would not be available and the default sorting (by created_at) would crash the test.

To check the mappings on an index you can point your browser to:

http://localhost:9200/users_development/user/_mapping

Further reading about mappings can be made here.

###as_indexed_json

This method defines the _source field in an elasticsearch match. Also, if no mapping is defined, it will index those fields with standard analyzers and will also assume what those fields represent according to the data they receive, for example, it will quickly learn what created_at is:

created_at: {
  type: "date",
  format: "dateOptionalTime"
},

Consider the following code:

# this is what the search engine will reply
def as_indexed_json(options={})
  self.as_json(
    methods: [:id_as_string, :name, :roles_count],
    only: [:id_as_string, :name, :email, :created_at, :current_sign_in_at, :roles, :company_id, :roles_count],
    include: {
      authentications: { only: [:provider]}
    }
  )
end

Note: authentications is a nested model, it's under include.

As you are probably aware, our users don't have a name, they have first_name and last_name, and a method that joins them. You can pass method names so elasticsearch-rails knows to use them when it runs. One method here that is really helpful is the id_as_string, because it allows me to search for user ids with a simple search form. (hint: string vs long)

###self.search

This is the method whose return is converted into the JSON that is sent to elasticsearch. It basically generates an hash with arrays inside, carrying all the information we need to get the results we want. The method used in users currently looks like:

# this method sets the search parameters, both query and filters
def self.search(params)
  search_info = {
    query: {},
    filter: {}
  }
  if params[:search_text] && params[:search_text] != ""
    query = params[:search_text]
    search_info.merge!({
      query: {
        multi_match: {
          query: query,
          fields: ['id_as_string^15', 'name', 'email', 'authentications.provider', 'raw_email']
        }
      }
    })
  else
    search_info.merge!({
      query: {
        match_all: {}
      }
    })
  end

  if (params[:start_date] || params[:end_date]) &&
     (params[:start_date] != "" || params[:end_date] != "")
    start_date = DateTime.parse(params[:start_date])
    end_date = DateTime.parse(params[:end_date]).end_of_day
    search_info[:filter][:bool] ||= {}
    search_info[:filter][:bool][:must] ||= []
    search_info[:filter][:bool][:must] |= [{ range: { created_at: { gte: start_date, lte: end_date } } }]
  end

  if params[:user_type] && params[:user_type] != ""
    qfilter = params[:user_type]
    search_info[:filter][:bool] ||= {}
    search_info[:filter][:bool][:must] ||= []
    search_info[:filter][:bool][:must] |= [{ term: { "roles" => qfilter } }]
    if qfilter == User::ROLES[:user].to_s
      search_info[:filter][:bool][:must] |= [{ term: { "roles_count" => qfilter.length.to_s } }]
    end
  end

  if params[:social_type] && params[:social_type] != ""
    qfilter = params[:social_type]
    search_info[:filter][:bool] ||= {}
    if qfilter == "none"
      search_info[:filter][:bool][:must] ||= []
      search_info[:filter][:bool][:must] |= [{ missing: { "field" => "authentications.provider" } }]
    else
      search_info[:filter][:bool][:must] ||= []
      search_info[:filter][:bool][:must] |= [{ term: { "authentications.provider" => qfilter } }]
    end
  end

  if params[:company_id] && params[:company_id] != ""
    qfilter = params[:company_id]
    search_info[:filter][:bool] ||= {}
    if qfilter == "-1"
      search_info[:filter][:bool][:should] ||= []
      search_info[:filter][:bool][:should] |= [{ missing: { "field" => "company_id" } }]
      search_info[:filter][:bool][:should] |= [{ term: { "company_id" => 0 } }]
    else
      search_info[:filter][:bool][:must] ||= []
      search_info[:filter][:bool][:must] |= [{ term: { "company_id" => qfilter.to_i } }]
    end
  end

  if params[:sort] && params[:sort] != ""
    column = params[:sort].split('-').last
    order = params[:sort].start_with?('-') ? 'desc' : 'asc'
    search_info.merge!(sort: [ column => { order: order }])
  else
    search_info.merge!(sort: [ created_at: { order: :desc }])
  end

   __elasticsearch__.search(search_info)
end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

elastic

Uh oh!

Clone this wiki locally