Elasticsearch is a distributed, highly scalable open-source search and analytics engine built on top of Apache Lucene. It’s designed to handle large amounts of data and provide near real-time search and analysis capabilities.
How Elasticsearch works internally?
Data Storage
Elasticsearch organizes data into a cluster, which consists of one or more nodes, operating in a distributed manner.
Each node is a separate instance of Elasticsearch running on a machine.
Data is organized into logical containers called indices and each of them can be split into smaller units called shards, where each shard is a self-contained index that resides on a specific node.
Shards are further divided into smaller segments called Lucene segments, which are the building blocks of the inverted index used for search.
Shards are the basis for distributing data across nodes in a cluster for parallel processing.
Data is stored as JSON documents within shards, each document is a piece of information and contains fields that store attribute values.
Indexing
When you index a document, Elasticsearch analyzes its content and stores it in a structured manner.
During analysis, the text is tokenized, filtered, and normalized to generate searchable terms.
These terms are then stored in an inverted index, which maps terms to the documents containing them.
The inverted index is partitioned into multiple shards to distribute the data across the cluster.
Documents are indexed in segments, which are immutable units of data storage and as documents are added or updated, new segments are created. Elasticsearch periodically merges smaller segments into larger ones for better search performance and disk space optimization.
Distributed Search
When a search query is executed, it gets broadcasted to all the shards in the cluster. The query is analyzed and transformed to match terms in the inverted index.
Each shard processes the query locally and returns the relevant results based on its local inverted index.
The coordinating node collects the results from all the shards and merges them to form the final response.
This distributed search approach enables Elasticsearch to parallelize the search workload and scale horizontally.
Query Execution
Elasticsearch supports a rich query DSL (Domain-Specific Language) that allows for complex searches.
Queries can involve full-text search, term matching, filters, aggregations, sorting, and more.
Elasticsearch uses various data structures and algorithms, including term vectors, postings lists, and caches, to efficiently execute queries.
It leverages the power of Apache Lucene’s search capabilities to perform scoring and ranking of documents based on relevance.
Cluster Coordination and Resilience
Elasticsearch uses a decentralized master-node architecture for cluster coordination.
The master node is responsible for cluster-wide operations like creating or deleting indices, tracking the health of nodes, and managing shard allocation.
Nodes communicate with each other using a gossip-based protocol to exchange cluster state information.
If a node fails, the master node detects the failure and initiates shard reallocation to maintain high availability.
Data Replication and High Availability
Elasticsearch provides data replication for fault tolerance and high availability.
Each shard has one or more replica shards that contain identical copies of the primary shard’s data.
Replicas serve as failover copies and can be promoted to primary shards if necessary.
Replication ensures that data is distributed across multiple nodes, allowing Elasticsearch to continue functioning even if some nodes fail.
Scalability and Performance
Elasticsearch scales horizontally by adding more nodes to the cluster.
New nodes can join the cluster seamlessly, and Elasticsearch redistributes shards automatically to balance the data load.
By distributing data and queries across multiple nodes, Elasticsearch can handle large volumes of data and support high query throughput.
Additionally, Elasticsearch provides features like caching, filtering, and optimizations to improve search performance.
This overview provides a glimpse into the internal workings of Elasticsearch, but the system is quite complex and includes many more features and optimizations.