Skip navigation

I was working on an edge case that involved a lot of users, and a lot of people following some users. In the 20M range. It would be equivalent to listing Twitter followers for popular people, such as:

prestweet.png

Like most DBAs, I am right up there too

edtweet.png

 

Looking up my IDs in a separate user table, the form of the query I was analyzing was something like this:

President: SELECT * FROM UserRel WHERE relatedUserID = 50000001 AND active = 1 AND state = 'approved' AND retirementdate = 0;

DBA: SELECT * FROM UserRel WHERE relatedUserID = 50000007 AND active = 1 AND state = 'approved' AND retirementdate = 0;


This seemed like a great opportunity to show why a (SELECT *) query can be very bad, as well as explore a feature new to Postgres 9.2, index-only scans, and where we can leverage that for performance.

Let's say my table following users contains the following columns:

ColumnData Type
relationshipidbigint
useridbigint
relateduseridbigint
relnshiptypeidinteger
statevarchar(20)
creationdatebigint
modificationdatebigint
retirementdatebigint
activeinteger
createdbyvarchar(100)
modifiedbyvarchar(100)

 

If I am working on queries in the application side, in my development environment, it is possible that I have 10 or 10K users in my test table, but I probably don't have 25 Million. When I go and query my own followers with

 

SELECT * FROM UserRel WHERE relatedUserID = 50000007 AND active = 1 AND state = 'approved' AND retirementdate = 0;

 

I get 97 rows back. Not a big deal for the app to handle, not a lot of work on the database, especially if I have an index on relatedUserId. In a table that also contains all of the President's followers, it is pretty likely that the query optimizer will go for the btree over a scan of this table. It will return super fast, my app will process and render the query it pulled, and everything will perform great. That will also be true with the more popular DBAs that have have broken 100 followers.

 

Four reasons for not using SELECT * in code

 

1) It makes you do the work twice.

When you see SELECT * FROM UserRel WHERE relatedUserID = 50000007 AND active = 1 AND state = 'approved' AND retirementdate = 0;

This means YOU are going to have to go and hunt down additional information. Like, what are all the columns included in * ? You can't really look at assignments, because they might not be assigning everything. You pretty much have to go to a database and find the data structure you are working with. When the columns are specified, you know exactly what data that query is after, making it easy for you to identify what is wrong or missing. The extra 20 seconds you take in putting in column names can save you a lot more time and potential errors when you are coding against multiple application and schema versions.


2) It makes the database do more work maybe.

I am sometimes on the fence on this one because it is something you would not see in an EXPLAIN, but if you SELECT *, then the DB has to go figure out what those columns are by asking INFORMATION_SCHEMA. I would say that even when specifying columns, the parse operation is going to have to hit INFORMATION_SCHEMA anyway, to validate that it has the columns you are asking for, so it is somewhat of a trade-off. But lets try and run this query a few times on a machine that won't have enough cache to hold it:

SELECT * FROM userrel WHERE relateduserid = '50000001' AND state = 'approved' and retirementdate=0;






Time: 282079.551 ms

Time: 293938.359 ms

Time: 274396.129 ms

Time: 281434.231 ms

Time: 345524.364 ms

Avg: 295274

SELECT relationshipid, userid, relnshiptypeid, creationdate, modificationdate FROM userrel WHERE relateduserid = '50000001' AND state = 'approved' and retirementdate=0;






Time: 301933.497 ms

Time: 295023.368 ms

Time: 269755.984 ms

Time: 293430.448 ms

Time: 273674.244 ms

Avg: 286763

In this not-too-scientific example, it looks like, more or less, you might get faster times when specifying columns, but this is not the most important reason.


3) Why are you asking for something you already know? In the case of our query, we already know the relatedUserId, active, and state columns, so why would we ask Postgres to return those values again? If you already knew them on the app side, you can save the application the extra round trip and assignments. Sometimes this is a savings of a few ms, but if it is part of a method that gets called 100 times a minute, then those saved milliseconds add up to saved hours very rapidly.


4) Math. This is my favorite one!

 

Even if I am storing my characters in unicode, and they all have 20 character values, I would still only be looking at 97 rows, multiplied by:

    • 6 * 8 byte bigints = 48 bytes
    • 2  * 4 byte ints = 8
    • 3 * 20 * 4 byte varchars (+3 for the varchar bit) = 244

 

So ~300 bytes per entire record, total. At 97 records, we are looking at a grand total of 28K. Practically nothing, My 300GB beastly servers could handle this result set 11 Million times. Applications on machines today should be able to chew up and spit out 28K in microseconds. Even at 10X, 100X, 1KX you are talking about relatively trivial amounts of space needed to hold and process this result set.

 

But lets take the President's ~25M followers:

25M * 300 bytes = 7GB. Well now, that's a little more formidable! That is a lot of data for an application to ask to receive from a database. Lots of memory on both sides, lots of data transfer, and fair amount of IO.

 

If I were to exclude just the stuff I know from the predicate, I save ~101bytes, and I am only sending 4.6GB...so my 8GB application server went from dying to breathing, and I still know all of the answers I knew by going after SELECT *. Even still, a closer look would suggest that I don't care who made the record or who modified it, I save another 82 bytes per record and I am down to 2.7GB. Still, a lot of data for an app to swallow all at once, but less than half of what I was going to get on SELECT *. Any application, network and disk will like you more if you are only trying to push 2.6GB of data over a pipe onto disks vs. 7GB.

 

Those should be four compelling reasons not to make SELECT * calls in code. I make SELECT * calls all the time, but almost always it is (1) limited to a few rows and (2) only for discovery purposes.

 

There are, of course, other approaches to challenges like this, and this doesn't solve the entire problem. We have reduced the amount of data that is pushed over the network to the application, but have we done anything for the work the DB has to do? Not really. That's because typical RDBMSes store data row by row, and the rows are stored in data pages. If you need to read a column from a row, you read the entire page, and effectively the entire row. So while we have reduced the result set to the application, the database is still reading all 25M rows when we want to see who is following the President, we have not necessarily saved any work on the DB side.

 

This is where a new Postgres 9.2 feature known as index-only scans might provide some benefit. There are a lot of conditions that need to be in place for an index-only scan to fire, but in many cases, the index-only scan will reduce the amount of data Postgres has to read. Let's try it out!

 

Index Only Scans in Postgres 9.2

 

An index-only scan is a feature that will allow a query result to be returned by accessing only a table's index. Specific conditions are required for one to be invoked, and because it is still cost-based, there are times when you think a coverage index will solve all of your problems, to be disappointed by never seeing it show up in an EXPLAIN plan.

 

Let's say I am still looking at related users, but I want to see what kind of relationship they have. This is captured in the [relnshiptypeid] column in my sample data. With an index on (relateduserid, relnshiptypeid), can I return the result without touching the table?

For the DBA relationships, definitely!

EXPLAIN ANALYZE SELECT relnshiptypeid FROM userrel WHERE relateduserid = 50000007;

                                                                   QUERY PLAN                                                             

------------------------------------------------------------------------------------------------------------------------------------------------

Index Only Scan using jvusrrel_reluserid_typeid on userrel  (cost=0.00..344.02 rows=1267 width=4) (actual time=14.251..14.286 rows=97 loops=1)

   Index Cond: (relateduserid = 50000007)

   Heap Fetches: 0

 

What about if I am looking at the President's relationships?

 

  EXPLAIN ANALYZE SELECT relnshiptypeid FROM userrel WHERE relateduserid = 50000001;

                                                                          QUERY PLAN                                                                      

------------------------------------------------------------------------------------------------------------------------------------------------------------------

Index Only Scan using jvusrrel_reluserid_typeid on userrel  (cost=0.00..1143527.66 rows=11933132 width=4) (actual time=33.110..468194.147 rows=27069389 loops=1)

   Index Cond: (relateduserid = 50000001)

   Heap Fetches: 0

 

It worked, but it was damn expensive! Probably not the way you want to retrieve this information, but if you had to, this would do it without hitting the userrel table.

 

For this particular case, it is likely our coverage index would work great for a lot of users, but over time there would definitely be a reduction in use because of updates. This has to do with the consideration Postgres makes against the visibility map. Does that mean I should drop the index? Probably not, if we consider the data set of users and followers, it is likely that most users have very few followers relative to President Obama, and most users also have very few changes to their set of followers, so you may get a lot of benefit out of an index like this and Postgres 9.2. The way to know for sure is to try it, make sure you are not killing anything with the extra writes, and looking at pg_stat_user_indexes to see how often your coverage indexes are being hit. If the number of hits is consistently going up, then you are using the index and likely reducing IO against the physical table. If it was going up, and stopped, take a look at your VACUUM and ANALYZE settings. If it just never seems to be getting hit, drop it and look for different ways to make things faster.

 

When the index-only scan is invoked, you will save on a couple of things:

  • The amount of data read on the DB side will be limited to the width of your index, not the width of the table rows. If you do the math, this can be a huge savings.
  • The query does not touch the table, which means concurrent operations are reduced on that table, and helps to reduce any concurrency-based performance issues. If you have queries that MUST scan the table, but you have some that can be saved by an index-only scan, it will improve performance and can reduce concurrency-based slowness.

 

This does have more of an advantage on tables that are not updated frequently, as the query optimizer looks at Postgres' visibility map. As the visibility map changes due to updates (between vacuums) if it is updated frequently, there will be more off bits in the map, and the engine will decide not to use an index-only scan.

 

As with all things in databases, the trade-offs are increased writes with more indexes, as well as more space consumed by the database. Testing is the key to finding

 

There are certainly caveats associated with index-only scans. I highly recommend checking out the PG Wiki, as they have an awesome breakdown of how it works in Postgres, what they needed to do to make it work, and what it does (and does not) help. As with every index and query plan, it depends, and it should be tested against a lot of different data distributions.

 

So, abandon SELECT * for these instant benefits:

  • Easier to see what the code is getting from the DB
  • Save lots of data transfer, crunching to\from your application
  • DBAs will kill less kittens

 

Then add the appropriate Postgres 9.2 indexes for your critical queries in order to super-charge the results, and reduce concurrency problems on busy tables.

 

References:

What's new in PostgreSQL 9.2 - PostgreSQL wiki

Index-only scans - PostgreSQL wiki

Congratulations to the Producteev team for releasing their first app into the Jive Apps Market! The Producteev app is the first step in making one of the recent acquisitions in our portfolio available within the Jive platform.

producteev.png

The app takes full advantage of the !App capability so that you can quickly "app mention" tasks in your Producteev workspaces, like say, Create announcement Blog for Producteev. You can also quickly make your discussions actionable by creating a task via the "App Action".

producteev1.png

 

Producteev is available for free in the Jive Apps Market. Install it today and make your teams more productive!

 

Enjoy!

Welcome 2013!

Posted by mark.weitzel Jan 3, 2013

HAPPY NEW YEAR!!!

 

Thank you to everyone who helped make 2012 incredible! Looking back, we accomplished a lot and had a ton of fun. So as we ring in 2013, let's take a quick look back at some of my favorites from 2012....

 

Continuously Hopped for 90 Minutes

Screen Shot 2013-01-03 at 1.50.51 PM.png

 

We re-launched the Jive Developer Community. With the new "beer" theme, not only does it reflect the spirit of Jive, after all, each release is given the code name of a fine brew, but it's also "less filling and tastes great!" On top of that, we brought the Jive Community and our Developer Community together in one place. Now, our app partners, developers, Jive Champions, can all work together and collaborate more easily. Many of our app partners, such as INXPO and The specified item was not found. have already created public groups where you can give them feedback, request features, and learn more about their apps.

 

 

 

 

First Time Ever...

Screen Shot 2013-01-03 at 1.57.28 PM.png

 

Speaking of app partners, Lucidchart is now pre-installed with every Jive instance. This is the first time a third party app shows up by default in new instances of Jive. Pretty sweet! You can check out all the details in an earlier blog post, Idea ~ Draw ~ Collaborate! Lucidchart Now a Pre-installed App

 

 

 

 

!Apps

We also introduced Jive's "App Experiences". Based on OpenSocial 's embedded experiences, !apps enable an entirely new way to extend the platform and interact with applications. With one click, you can quickly work with applications in content, in the stream, an pretty much wherever you work in Jive. We used this same mechanism to radically simplify installing applications as well. Now, if you click on an artifact from an application that you don't have installed, the Jive Apps Market will come to you!

jive_devslanding_transparent.png

V3 Rocks!

If you have not used the V3 API yet, you are totally missing out. This is a complete overhaul of our API and offers almost complete platform coverage. Our goal was that if the Jive UI did it, you should be able to do it through the API. You can now create users and all content types. You've got access to tags, metadata, and much more. The V3 API supercharges your integration efforts.

 

Get Your JiveOn!

We want to get the world's best and most innovative social collaboration platform into the hands of the world's best and most innovative startups. That's why we created the Jive for Startups. By creating and offering an integration for example an app, startups will be able to use Jive for free.

 

What Happens in Vegas... gets delivered in the Jive Apps Market

Let's not forget Jive World 12! We've been telling you how easy it is to build powerful apps and at #jw12, we proved it! Thanks to Aron Racho who led the development team, and Ryan Vanderzanden who led the designers, we pulled off the first every Jive World "hack-a-thon" where we built the Jive Albums app, start to finish, in less than 30 hours? Where else but Vegas and Jive World can you choose an application and have it built and delivered to you less than three days? Certainly not in Disney World.

 

Keeping the Spirit Alive

In keeping the spirit of Jive World alive, we held the first Jive Hack-a-Thon in Research Triangle Park. A small group of us from Fidelity and Deutsche Bank got together to get our hands wet building Jive Apps. Now, everyone knows that building apps requires energy, and ours came courtesy of Fullsteam. We quickly ramped up a number of developers on Jive Apps. It was a good hands on experience that we'd like to do more of in 2013. If you are interested in hosting a hack-a-thon, please send me a direct message and let me know.

 

 

2013: Better. Stronger. Faster

honeybadger50png13b7261340c.png

2013 promises to be another fast paced and exciting year. We've got a head start on making more apps--and their source code--available to you. We are working to add an "open source" flavor to our Developer Community through the [Archived] Jive Open INnovation (JOIN) group. We'll continue our commitment to open technologies by working with the OpenSocial Foundation, JQuery, and others. With the upgrade to 6.0, we'll be adding more fun "missions" in our gamification module that will help you get involved in the developer community. Of course, we've got some goodies coming soon that will make leveraging the power of Jive and integrating with the platform even easier. Stay tuned, you're gonna love it.

 

We've got a lot to do, so let's get coding!

Filter Blog