This week (as usual urgently due to need and poor client planning) a colleague migrated a Magento 1.9 to one of our servers. In principle, although the technology to run a Magento 1 is usually straightforward, there are always some verifications, optimizations (many developers still do not know that you can use Redis for cache and sessions) and tests to be done.
After the initial migration, some tests that pages load correctly and client validation that everything is OK, DNS is changed and all the traffic starts to arrive and with it the surprise. The site, when accessing some categories and almost randomly, starts leaving PHP-FPM pool processes blocked and with no response to the client browser.
My colleague contacts me to try to find a solution. The client only wants it to work and their argument and that of their developers is clear and irrefutable, on their old server it worked fine…
So what now?
The server configuration is studied for maximum performance and with the latest version of Debian and with hundreds of clients enjoying a stable, secure and fast environment. But there are often peculiarities.
A quick look at the MySQL processes shows a query in “Sending data” state that by deduction correspond to each of the pages and blocked PHP processes and they remained in that state indefinitely. Obviously something was happening with that query that in the new version of MariaDB was not working as it should (yes, many times MariaDB and MySQL do strange things in recent versions that worked fine in previous ones).
The MariaDB version on the old server was 5.5 and the current one 10.5. To do some testing, I disable all possible optimizations and leave the MariaDB server with the default configuration (like the old one) but nothing. Putting the same version of MariaDB on a recent version of Debian could be possible, either with Docker, systemd-nspawn, a chroot but the VPS itself is a container and does not play well with other container systems inside it. Percona and Oracle’s MySQL do not offer old installable versions for this version of Debian and compiling… it needs to be easy to maintain in the future and also have security support (MariaDB 5.5 is obsolete); this migration is starting to be completely unprofitable.
Before continuing down the path of installing an older MySQL (which also did not guarantee resolving the problem) I try to analyze the query to see if I can get anything clear:
SELECT count(DISTINCT e.entity_id) FROM `sales_flat_order_item` AS `order_items`
INNER JOIN `sales_flat_order` AS `order` ON `order`.entity_id = order_items.order_id AND `order`.state <> 'canceled'
LEFT JOIN `catalog_product_entity` AS `e` ON (e.type_id NOT IN ('grouped', 'configurable', 'bundle')) AND e.entity_id = order_items.product_id AND e.entity_type_id = 4
INNER JOIN `catalog_category_product_index` AS `cat_index` ON cat_index.product_id=e.entity_id AND cat_index.store_id=6 AND cat_index.visibility IN(2, 4) AND cat_index.category_id = '343' WHERE (parent_item_id IS NULL);
I have had in the past fights with complex queries where generally the lack of a key index (incomprehensible that it had not been added by the developer) generated a considerable performance degradation and the mere act of adding it changed the query execution time from more than 1 minute to less than a second, that multiplied with multiple concurrent visits to an e-commerce on certain pages makes a world of difference.
Going back to the query I start analyzing the tables involved, the fields that participate in the JOINs and the related indexes and I see that a possible candidate is missing: order_items.product_id. So I add an index for that column.
ALTER TABLE `sales_flat_order_item` ADD KEY `product_id` (`product_id`);
And problem solved, no more blocking with the query which becomes instantaneous.
This is how a simple index on a table can save your day.
Why the problem and why did it work fine with MariaDB 5.5 but not with 10.5? No idea, there may be some other factor I have not considered or some error in the database import. There is surely an explanation but it would require a lot of time for analysis and testing and the index resolved the problem.