This article was published over 2 years ago. Some information may be outdated.
This post is not finished yet.
I once worked at a company where the application used PostgreSQL for full-text searching. Queries took 30 seconds to return results. Thirty seconds. The database was denormalized, requiring multiple joins just to narrow down results -- and that was before you even attempted full-text search.
The boss refused to consider alternatives. That refusal cost everyone time and sanity. This is exactly the kind of problem Elasticsearch was built to solve.
What is Elasticsearch?
Elasticsearch is a highly-scalable open-source full-text search engine built on Apache Lucene.
You communicate with Elasticsearch over HTTP, sending and receiving JSON. That is the entire interface -- HTTP requests in, JSON responses out.
Key features:
- Highly optimized for searching through analyzers, stemmers, and other text processing techniques.
- Built on Apache Lucene, an ultra-fast search library, so it inherits Lucene's indexing engine.
- REST API: send JSON requests, get JSON responses.
- Fast and straightforward to work with.
Installing Elasticsearch
The official installation guide is here.
If you're running Docker -- as I do -- spin up a container:
docker run -d --name elasticsearch -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" elasticsearch:7.9.0
Elasticsearch uses port 9200 for the REST API. Verify it's running:
curl http://localhost:9200
{
"name" : "d1da3c443ba0",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "OqUjXWjgRDOln-_RXfNARA",
"version" : {
"number" : "7.8.0",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "757314695644ea9a1dc2fecd26d1a43856725e65",
"build_date" : "2020-06-14T19:35:50.234439Z",
"build_snapshot" : false,
"lucene_version" : "8.5.1",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
Elasticsearch vs RDBMS
You might wonder why you need another database when MySQL and PostgreSQL already support full-text search.
The short answer: Elasticsearch is purpose-built for searching. MySQL and PostgreSQL can search, but not efficiently at scale.
Elasticsearch is a NoSQL database. It does not have relations, joins, or transactions the way an RDBMS does.
Elasticsearch does have a form of simple and limited joining.
Where Elasticsearch excels:
- Full-text searching across large datasets where relevance ranking matters.
- Autocomplete features.
- Did-you-mean suggestions for misspelled words.
- Searching inside files like PDF, XLS, and PPT.
- Horizontal scaling by default.
Elasticsearch uses text analysis to deliver relevant results. Search for "father" and also match "dad", "daddy", "papa", "parent" -- Elasticsearch handles that natively.
If you need transactions and joins, you need an RDBMS. In practice, most applications run Elasticsearch alongside an RDBMS. You get the search performance of Elasticsearch and the relational guarantees of your database. The trade-off is that you must keep data synchronized between the two.
Elasticsearch Concepts
Here are the core terms you need to know:
- Index: similar to a database table. Contains a collection of documents.
- Document: similar to a database record. Any structured JSON data.
- Field: same as a database field. Fields have types: long, date, byte, short, integer, etc.
- Multi-Fields: unlike databases, a single Elasticsearch field can have multiple sub-fields with different data types (covered later).
- Mapping: the process of defining data types for fields.
- Analyzer: breaks text into small searchable tokens.
- Shard: an index can be split into smaller pieces called shards. Shards can be distributed across nodes (machines).
Index
An index is a collection of documents.
The term "index" also refers to the act of storing data in Elasticsearch. Throughout this post, I use "indexing" to mean storing data.
Each index contains one or more shards.
You can inspect where Elasticsearch stores indices on disk:
docker exec -it elasticsearch sh
cd /usr/share/elasticsearch/data/nodes/0/indices
Indices live inside the nodes folder because Elasticsearch is scalable by default -- you can add more nodes to the cluster at any time.
Elasticsearch can only index text-based documents. For PDF, PPT, and XLS files, use the Elasticsearch-Mapper-Attachments plugin (based on Apache Tika).
Document
A document is any structured JSON data.
Sharding
Sharding is essential to understand.
Say you want to index 1 TB of data but your node only has 500 GB of disk space. Add a second node, and Elasticsearch can distribute data across both.
Sharding divides an index into smaller pieces. Each piece is a shard. A 1 TB index can be split into four 250 GB shards spread across four nodes.
Sharding operates at the index level, not at the cluster or node level.
Think of each shard as an independent index.
Creating our first index
Elasticsearch exposes a REST API. You can use any HTTP client -- curl, Postman, Insomnia. This post uses cURL.
Create an index named "employees":
curl -XPUT 'http://127.0.0.1:9200/employees?pretty'
Appending
prettyreturns formatted JSON.
{
"acknowledged": true,
"shards_acknowledged": true,
"index": "employees"
}
Retrieve information about the index:
curl -XGET 'http://127.0.0.1:9200/employees?pretty'
{
"employees": {
"aliases": {},
"mappings": {},
"settings": {
"index": {
"creation_date": "1590765061412",
"number_of_shards": "1",
"number_of_replicas": "1",
"uuid": "RVbipfdBSRG0SSLhLl0Ogg",
"version": {
"created": "7070099"
},
"provided_name": "employees"
}
}
}
}
The index is ready. Time to add documents.
Creating documents
Send JSON data to create a document:
curl --location --request POST 'http://localhost:9200/employees/_doc' \
--header 'Content-Type: application/json' \
--data-raw '{
"first_name": "Ahmad",
"last_name": "Iraq",
"birth_date": "1986-06-11"
}'
{
"_index": "employees",
"_type": "_doc",
"_id": "u_t_ZXIB2DKxG4ehFP57",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 1,
"_primary_term": 1
}
Every document needs a unique ID. If you don't provide one, Elasticsearch generates one automatically.
To specify your own ID, append it to the URL:
http://127.0.0.1:9200/employees/_doc/1
To retrieve all indexed documents:
curl --location --request GET 'http://localhost:9200/employees/_search'
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "employees",
"_type": "_doc",
"_id": "Q_uEZXIB2DKxG4ehof_Y",
"_score": 1.0,
"_source": {
"first_name": "Ahmad",
"last_name": "Iraq",
"birth_date": "1986-06-11"
}
}
]
}
}
The
_sourcekey contains the entire document as you indexed it.
Summary
- Elasticsearch is purpose-built for search -- it is not a replacement for your RDBMS, but a complement to it.
- It communicates over HTTP with JSON, making integration straightforward from any language or tool.
- Sharding distributes data across nodes, giving you horizontal scalability out of the box.
- Text analysis is what sets it apart -- features like stemming, synonyms, and analyzers make it far more capable than SQL-based full-text search.
- Run it alongside your RDBMS when you need both search performance and relational guarantees, but plan for data synchronization.