ElasticSearch Part 5 [Selecting Shards Algorithm]

Let me explain this in the natural language. Lets suppose there are 10 shards across all the network, and we need to write a document in the shard. And you start to think about it like how should I select the shards to which we need to write. Don't worry it can ES can select the shards randomly. Yes you think you are smart? think again what will happen when you want to retrieve the data again do you want to use random again fuck no because it may not get the required data in the random shards we pick.

Don't worry there is a algorithm that is used by the partitioner in the ES which determines the shards where we need to write the document. For this the document id is used.

shard = hash(routing) % number_of_primary_shards

here routing value is the arbitrary string which by default is the doc_id. Lets take a example to understand it. Let us suppose we have 10 shards all across the network and we want to put a user detail in one of the shards. Let user doc_id be 100 and after hashing we get some number like 66 then

shard = 66 % 10 = 6

hence the partitioner will write this document in the shard with index 6 i.e. the 7th one. And also use the same algorithm to retrieve it. In the above process the remainder will always be in the range of 0 to number_of_primary_shards -1.

But there is some drawbacks in the scaling of ES. Users sometimes think that having a fixed number of primary shards makes it difficult to scale out an index later. In reality, there are techniques that make it easy to scale out as and when you need. For this we need to re-index all the data and there are some techniques to do it which we will discuss in later post.

1 comment: