I was working on an edge case that involved a lot of users, and a lot of people following some users. In the 20M range. It would be equivalent to listing Twitter followers for popular people, such as:
Like most DBAs, I am right up there too
Looking up my IDs in a separate user table, the form of the query I was analyzing was something like this:
President: SELECT * FROM UserRel WHERE relatedUserID = 50000001 AND active = 1 AND state = 'approved' AND retirementdate = 0;
DBA: SELECT * FROM UserRel WHERE relatedUserID = 50000007 AND active = 1 AND state = 'approved' AND retirementdate = 0;
This seemed like a great opportunity to show why a (SELECT *) query can be very bad, as well as explore a feature new to Postgres 9.2, index-only scans, and where we can leverage that for performance.
Let's say my table following users contains the following columns:
If I am working on queries in the application side, in my development environment, it is possible that I have 10 or 10K users in my test table, but I probably don't have 25 Million. When I go and query my own followers with
SELECT * FROM UserRel WHERE relatedUserID = 50000007 AND active = 1 AND state = 'approved' AND retirementdate = 0;
I get 97 rows back. Not a big deal for the app to handle, not a lot of work on the database, especially if I have an index on relatedUserId. In a table that also contains all of the President's followers, it is pretty likely that the query optimizer will go for the btree over a scan of this table. It will return super fast, my app will process and render the query it pulled, and everything will perform great. That will also be true with the more popular DBAs that have have broken 100 followers.
Four reasons for not using SELECT * in code
1) It makes you do the work twice.
When you see SELECT * FROM UserRel WHERE relatedUserID = 50000007 AND active = 1 AND state = 'approved' AND retirementdate = 0;
This means YOU are going to have to go and hunt down additional information. Like, what are all the columns included in * ? You can't really look at assignments, because they might not be assigning everything. You pretty much have to go to a database and find the data structure you are working with. When the columns are specified, you know exactly what data that query is after, making it easy for you to identify what is wrong or missing. The extra 20 seconds you take in putting in column names can save you a lot more time and potential errors when you are coding against multiple application and schema versions.
2) It makes the database do more work maybe.
I am sometimes on the fence on this one because it is something you would not see in an EXPLAIN, but if you SELECT *, then the DB has to go figure out what those columns are by asking INFORMATION_SCHEMA. I would say that even when specifying columns, the parse operation is going to have to hit INFORMATION_SCHEMA anyway, to validate that it has the columns you are asking for, so it is somewhat of a trade-off. But lets try and run this query a few times on a machine that won't have enough cache to hold it:
SELECT * FROM userrel WHERE relateduserid = '50000001' AND state = 'approved' and retirementdate=0;
Time: 282079.551 ms
Time: 293938.359 ms
Time: 274396.129 ms
Time: 281434.231 ms
Time: 345524.364 ms
SELECT relationshipid, userid, relnshiptypeid, creationdate, modificationdate FROM userrel WHERE relateduserid = '50000001' AND state = 'approved' and retirementdate=0;
Time: 301933.497 ms
Time: 295023.368 ms
Time: 269755.984 ms
Time: 293430.448 ms
Time: 273674.244 ms
In this not-too-scientific example, it looks like, more or less, you might get faster times when specifying columns, but this is not the most important reason.
3) Why are you asking for something you already know? In the case of our query, we already know the relatedUserId, active, and state columns, so why would we ask Postgres to return those values again? If you already knew them on the app side, you can save the application the extra round trip and assignments. Sometimes this is a savings of a few ms, but if it is part of a method that gets called 100 times a minute, then those saved milliseconds add up to saved hours very rapidly.
4) Math. This is my favorite one!
Even if I am storing my characters in unicode, and they all have 20 character values, I would still only be looking at 97 rows, multiplied by:
- 6 * 8 byte bigints = 48 bytes
- 2 * 4 byte ints = 8
- 3 * 20 * 4 byte varchars (+3 for the varchar bit) = 244
So ~300 bytes per entire record, total. At 97 records, we are looking at a grand total of 28K. Practically nothing, My 300GB beastly servers could handle this result set 11 Million times. Applications on machines today should be able to chew up and spit out 28K in microseconds. Even at 10X, 100X, 1KX you are talking about relatively trivial amounts of space needed to hold and process this result set.
But lets take the President's ~25M followers:
25M * 300 bytes = 7GB. Well now, that's a little more formidable! That is a lot of data for an application to ask to receive from a database. Lots of memory on both sides, lots of data transfer, and fair amount of IO.
If I were to exclude just the stuff I know from the predicate, I save ~101bytes, and I am only sending 4.6GB...so my 8GB application server went from dying to breathing, and I still know all of the answers I knew by going after SELECT *. Even still, a closer look would suggest that I don't care who made the record or who modified it, I save another 82 bytes per record and I am down to 2.7GB. Still, a lot of data for an app to swallow all at once, but less than half of what I was going to get on SELECT *. Any application, network and disk will like you more if you are only trying to push 2.6GB of data over a pipe onto disks vs. 7GB.
Those should be four compelling reasons not to make SELECT * calls in code. I make SELECT * calls all the time, but almost always it is (1) limited to a few rows and (2) only for discovery purposes.
There are, of course, other approaches to challenges like this, and this doesn't solve the entire problem. We have reduced the amount of data that is pushed over the network to the application, but have we done anything for the work the DB has to do? Not really. That's because typical RDBMSes store data row by row, and the rows are stored in data pages. If you need to read a column from a row, you read the entire page, and effectively the entire row. So while we have reduced the result set to the application, the database is still reading all 25M rows when we want to see who is following the President, we have not necessarily saved any work on the DB side.
This is where a new Postgres 9.2 feature known as index-only scans might provide some benefit. There are a lot of conditions that need to be in place for an index-only scan to fire, but in many cases, the index-only scan will reduce the amount of data Postgres has to read. Let's try it out!
Index Only Scans in Postgres 9.2
An index-only scan is a feature that will allow a query result to be returned by accessing only a table's index. Specific conditions are required for one to be invoked, and because it is still cost-based, there are times when you think a coverage index will solve all of your problems, to be disappointed by never seeing it show up in an EXPLAIN plan.
Let's say I am still looking at related users, but I want to see what kind of relationship they have. This is captured in the [relnshiptypeid] column in my sample data. With an index on (relateduserid, relnshiptypeid), can I return the result without touching the table?
For the DBA relationships, definitely!
EXPLAIN ANALYZE SELECT relnshiptypeid FROM userrel WHERE relateduserid = 50000007;
Index Only Scan using jvusrrel_reluserid_typeid on userrel (cost=0.00..344.02 rows=1267 width=4) (actual time=14.251..14.286 rows=97 loops=1)
Index Cond: (relateduserid = 50000007)
Heap Fetches: 0
What about if I am looking at the President's relationships?
EXPLAIN ANALYZE SELECT relnshiptypeid FROM userrel WHERE relateduserid = 50000001;
Index Only Scan using jvusrrel_reluserid_typeid on userrel (cost=0.00..1143527.66 rows=11933132 width=4) (actual time=33.110..468194.147 rows=27069389 loops=1)
Index Cond: (relateduserid = 50000001)
Heap Fetches: 0
It worked, but it was damn expensive! Probably not the way you want to retrieve this information, but if you had to, this would do it without hitting the userrel table.
For this particular case, it is likely our coverage index would work great for a lot of users, but over time there would definitely be a reduction in use because of updates. This has to do with the consideration Postgres makes against the visibility map. Does that mean I should drop the index? Probably not, if we consider the data set of users and followers, it is likely that most users have very few followers relative to President Obama, and most users also have very few changes to their set of followers, so you may get a lot of benefit out of an index like this and Postgres 9.2. The way to know for sure is to try it, make sure you are not killing anything with the extra writes, and looking at pg_stat_user_indexes to see how often your coverage indexes are being hit. If the number of hits is consistently going up, then you are using the index and likely reducing IO against the physical table. If it was going up, and stopped, take a look at your VACUUM and ANALYZE settings. If it just never seems to be getting hit, drop it and look for different ways to make things faster.
When the index-only scan is invoked, you will save on a couple of things:
- The amount of data read on the DB side will be limited to the width of your index, not the width of the table rows. If you do the math, this can be a huge savings.
- The query does not touch the table, which means concurrent operations are reduced on that table, and helps to reduce any concurrency-based performance issues. If you have queries that MUST scan the table, but you have some that can be saved by an index-only scan, it will improve performance and can reduce concurrency-based slowness.
This does have more of an advantage on tables that are not updated frequently, as the query optimizer looks at Postgres' visibility map. As the visibility map changes due to updates (between vacuums) if it is updated frequently, there will be more off bits in the map, and the engine will decide not to use an index-only scan.
As with all things in databases, the trade-offs are increased writes with more indexes, as well as more space consumed by the database. Testing is the key to finding
There are certainly caveats associated with index-only scans. I highly recommend checking out the PG Wiki, as they have an awesome breakdown of how it works in Postgres, what they needed to do to make it work, and what it does (and does not) help. As with every index and query plan, it depends, and it should be tested against a lot of different data distributions.
So, abandon SELECT * for these instant benefits:
- Easier to see what the code is getting from the DB
- Save lots of data transfer, crunching to\from your application
- DBAs will kill less kittens
Then add the appropriate Postgres 9.2 indexes for your critical queries in order to super-charge the results, and reduce concurrency problems on busy tables.