Why Upgrading ElasticSearch Was Not an Easy Task

ElasticSearch, they say, packs a “ton of goodness into each release“, and if you skip a few tons of goodness, it can lead to goodness overflow that we experienced while upgrading it.
One might say we had a peculiar idea of good usage of ElasticSearch mapping types, so we just used them for everything — keys in arrays, table names, search etc.
That was the primary reason why the upgrade waited so long. I mean, we were stuck on version 5.3.2 aiming to jump to 7.10.1. The code depended heavily on the mapping types.
Another problem entirely was the complete removal of custom plugins. One feature we had, had to be completely shut down because it needed a custom elastic plugin to perform. Luckily, it was never enabled on the production so it was no biggie, right?
To give you a better idea of what I’m talking about, here is a small sample of what our mappings looked like before upgrading ElasticSearch:
mapping_type_1:
active: {type: byte, index: 'true'}
additional: {type: integer, index: 'true'}
. . .
mapping_type_2:
active: {type: byte, index: 'true'}
additional: {type: integer, index: 'true'}
. . .
. . .
mapping_type_36:
active: {type: byte, index: 'true'}
additional: {type: integer, index: 'true'}
. . .
We had four indices in our 5.3.2 cluster, three of those posed no problem to upgrade. We even managed to completely remove one index because there were around 300 documents indexed in it, so there was no reason why that data could not be retrieved directly from the database.
That one index that remained, had 36 mapping types that were same-same but different. At this point, we did what anyone would have done — check the ElasticSearch official documentation for the recommended procedure. And now we had two options:
We went with the second option, combining all the fields in one mapping. By doing that, we got one index with a lot, and I mean, a LOT of fields. But it was still better that the other option, creating 36 different indices with almost identical mappings. Another argument for “one ultimate mapping option“ was the fact that we would have to cross index search all the indices without losing any performance.
Good. We have a course of action, what now?
Let’s summarize the situation:
We started the great cleanup / refactor / rewrite session to merge all those numerous dynamic mapping files into one file which would then be combined with static mappings. The mapping types were removed in this step, and the mapping type name was added as a new field to the static mappings. That way we didn’t have to rewrite the entire application and we could use ElasticSearch 7.10.1. The new static mappings file ended up looking something like this:
_doc:
class: {type: text, index: 'true'}
active: {type: byte, index: 'true'}
additional: {type: integer, index: 'true'}
. . .
This “easy” part was followed by the removal of dependencies on mapping types across the entire code base. Hours turned to days, days to weeks, and a few weeks later we finally managed to refactor all the places that fetched mapping types from elastic and did magic with them.
Indexing documents, creating and manipulating indices in any way was a whole procedure that required a hefty multi-step document. It seemed as good a time as any to refactor it.
Instead of a three-page procedure we now had five console commands: Create, Delete, Index, Replay and AddToQueue all of which used ruflin/elastica to communicate with the ElasticSearch cluster in the background.
The update queue is just one table in the database where the ID of the changed document and the name of the index are stored. Once the queue is enabled, any changes that go to the ElasticSearch index with the write alias are also recorded to the queue.
The AddToQueue command is intended to be used to easily add one or more IDs to the update queue table. This could be useful if for some reason some documents aren’t in sync with the database.
The Replay command then takes chunks of ids from the update queue and bulk upserts (insert or update) that data into the appropriate index that has the write alias. Once the documents are updated or inserted, the records are simply deleted from the update queue table.
The Index command creates a new index with a write_new alias, enables syncing changes to the queue and bulk inserts data from the database to the index. After all documents are inserted, the write alias is switched to the new index, the update queue is replayed via the Replay command, the read alias is switched to the new index and the old one is deleted. And voila, indexing with zero downtime!
How are we going to deploy this huge change in a way that everything works? Once again, to the documentation! This left us with several possibilities:
Since we wanted to upgrade without downtime, we went with the second option → reindex from a remote cluster. For this to happen we had to have two parallel clusters:
We deployed the code overnight when we have the least amount of traffic on the site. To guide you through our deploy process I will list the deploy actions.
Nothing went wrong. Mission accomplished!