4

I'm using MySQL 5.6 and my storage engine is InnoDB.

I have a table with 1 million rows containing with the columns:

  • ID (primary key)
  • FirstName
  • LastName
  • foreign_key_id (foreign key, NOT NULL)
  • foreign_key_id2 (another foreign key, default NULL)

The rows are seperated under:

  • 25% with foreign_key_id value 1 and foreign_key_id2 NULL
  • 25% with foreign_key_id value 1 and foreign_key_id2 NOT NULL
  • 25% with foreign_key_id value 2 and foreign_key_id2 NULL
  • 25% with foreign_key_id value 2 and foreign_key_id2 NOT NULL

With the following indexes:

  • index foreign_key_idx on foreign_key_id
  • index foreign_key_2_idx on and foreign_key_id2
  • composite index foreign_key_comp_idx on (foreign_key_idx, foreign_key_2_idx)

I perform the following queries:

Query 1 - without indexes:

SELECT *
FROM table tbl
IGNORE INDEX(foreign_key_idx, foreign_key_2_idx, foreign_key_comp_idx)
WHERE tbl.foreign_key_id = 1 AND tbl.foreign_key_id2 IS NOT NULL

Query 2 - with indexes (no composite index):

SELECT *
FROM table tbl
IGNORE INDEX(foreign_key_comp_idx)
WHERE tbl.foreign_key_id = 1 AND tbl.foreign_key_id2 IS NOT NULL

Query 3 - with composite index (no other indexes):

SELECT *
FROM table tbl
IGNORE INDEX(foreign_key_idx, foreign_key_2_idx)
WHERE tbl.foreign_key_id = 1 AND tbl.foreign_key_id2 IS NOT NULL

The results:

Query 1 (no indexes) performs a full table scan and uses 1 million records with a total duration of 0.37 seconds.

Query 2 (indexes, no composite index) performs a non-unique key lookup on foreign_key_idx index and uses 500K records with a total duration of 0.6 seconds.

Query 3 (composite index only) performs an index range scan on composite index and uses 480K records with a total duration of 0.13 seconds.

What I really don't understand is: why is query 2 (with indexes) always performing slower than query 1 (without indexes)? I'm really really stuck and need some help...

I've tested the queries above with different amount of rows, like 1k, 10k, 20k, 50k, 100k, 200k, 250k, 500k, 1M etc, always with the same ratio (25%), and the results where the same (query 2 always performing slow)

Thank you in advance, really appreciate any kind of input!

Edit (2 May 2016)

SHOW CREATE TABLE COMMAND:

CREATE TABLE `table` (
   `ID` int(11) NOT NULL AUTO_INCREMENT,
   `FirstName` varchar(255) NOT NULL,
   `LastName` varchar(255) NOT NULL,
   `foreign_key_id` int(11) NOT NULL,
   `foreign_key_id2` int(11) DEFAULT NULL,

   PRIMARY KEY (`ID`),
   KEY `foreign_key_idx` (`foreign_key_id`),
   KEY `foreign_key_2_idx` (`foreign_key_id2`),
   KEY `foreign_key_comp_idx ` (`foreign_key_id`,`foreign_key_id2`),

   CONSTRAINT `foreign_key_idx` FOREIGN KEY (`foreign_key_id`) REFERENCES `table2` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION,
   CONSTRAINT `foreign_key_2_idx` FOREIGN KEY (`foreign_key_id2`) REFERENCES `table3` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION,
 ) ENGINE=InnoDB AUTO_INCREMENT=1515998 DEFAULT CHARSET=latin1

EXPLAIN PLANS: enter image description here

Not sure if important, but table2 has 20 records and table3 also 1 million.

5
  • Use EXPLAIN and add the plans for all the queries- among other things it will show estimated number of rows to check. How are the 25% sets of rows distributed with the relation to the id? is 25% of the lowest IDs all the same values of the FKs? Or are they inserted mixed?
    – jkavalik
    Commented May 1, 2016 at 19:51
  • @jkavalik Edited!
    – TheOddGuy
    Commented May 2, 2016 at 13:14
  • Next time please just add the textual output of the explain, imho most people are better used to it :) (and it actually contains more data, the Extra column for example can show some optimizations being used)
    – jkavalik
    Commented May 2, 2016 at 14:55
  • Just to check - the queries return some 250K rows from 1M?
    – jkavalik
    Commented May 2, 2016 at 15:00
  • @jkavalik Yes they return 250K rows from 1M.
    – TheOddGuy
    Commented May 2, 2016 at 15:09

2 Answers 2

3

My understanding: The table contains 1M rows of which 250k are returned by the query. There are 500k rows with foreign_key_id = 1 and 500k rows with af.foreign_key_id2 IS NOT NULL.

The query using full table scan (actually doing full index scan on the PRIMARY key in InnoDB) will read all 1M rows sequentially and check each of them for the conditions.

The query using foreign_key_idx (and it should be the same if it used 'foreign_key_2_idx') has to read 500k rows by "random" order (supposing the rows were inserted or assigned the ID randomly) and check the other condition on them. That means the query reads 500k rows by their primary key from the table - but that means it will probably access 100% of data pages of the table. So in the end the query reads half of the index and all of the table - more data to go through in total and the table is accessed by random seeks, not sequentialy. It should be no surprise that such query is slower than full table scan.

The query going by foreign_key_comp_idx will find the part of the index for foreign_key_id = 1 by ref access method and then fetch its part satisfying af.foreign_key_id2 IS NOT NULL by range access - finding 250k rows which it will then read from the table. Again there is a high probability that it will have to read more than 50% of the table pages but it may just be lucky with the data distribution that it is less than 100% and as a plus it does not have to check the conditions again as they are assured by the index.

What surprises me: usual understanding is that the optimizer will not even want to use an index if it will lead to fetching ~20% of the table rows, prefering full table scan instead as it is usually faster. Your IGNORE INDEX() hints should not change that, thats what FORCE INDEX() hint is for (telling optimizer that the table scan is extremely costly compared to the suggested index).

But maybe if the insert order was not totally random, the statistics and index dives show that (even when the expected rows numbers do not).

1

I know I am late to the party here.

The Cardinality of the Foreign Keys is 2, meaning you only have 2 options being indexed - not very useful. If you had 2000 possibilities for those FK's it would be useful to index.

Every insert, update is requiring an update to the index and in this case is not necessary - every query has an extra step - again not necessary .

Think of Indexing like grouping and sorting, a pointer if you will .. if you only have two options you are grouped in large groups and not sorted much.

I can not say that this is how it is working in the db engine but this is how I like to think about indexes at this point.

If you had 2,000 possible FK values - this is where the advantage would be as the possible shrinkage (segmentation of data you are looking at could be 500 - rather than 500,000).

That is a simplified explanation - but you will notice indexes get updated on inserts (they work really great on sequences especially when you don't insert in the middle of the sequence but at the end ) .

Not the answer you're looking for? Browse other questions tagged or ask your own question.