Skip navigation

Developer

3 Posts authored by: edinor Employee

Day two at Portland's 5th Annual Open Source Bridge, it has been a blast so far! I wanted to report in before I go into full panic mode in preparation for my talk PostgreSQL Replication: The Most Exciting Technology on Earth. I have spoken before, but I really work myself up beforehand. Speaking in front of smart people is thrilling and terrifying, but it'll be so much better for me not to ponder on that and instead talk about this awesome conference.

 

This is a high-tech conference, though the tracks have unusual names like Cooking, Chemistry and Culture. They cover a huge scope of topics too, from specific technical deep dives to career and life development and improvement, Open Source Bridge has something for everyone. There is a giant table of Legos in the middle of the hacker lounge, groups of smarties talking shop, soldering and socializing all over the place, really a lot of fun. The PyLadies, MongoDB User Group, R User Group and PDX-OSGEO - OSGeo Groups are all representing, as are geeks from all sorts of different cool Companies and industries.

 

Conferences are a great way to meet people and spend some time away to think about new ideas. As a speaker, I find that both the preparation to make a good presentation as well as the follow-up questions and conversations from the presentation ultimately make me better at being a DBA. Everyone has a different, unique problem they've faced, an important lesson they've learned, and an innovation solution they've applied, and sharing that information is what makes us all better. As an attendee, I find presentations to be a great introduction to a concept or technology (you can only go so deep in 45 minutes), and a way to think about the not-too-distant-future instead of what you are working on today.

 

If you don't make this year, no problem! Everything is videotaped and available after the conference, but if you can make it in person, you should just come and work in the Hacker Lounge, the geek energy is strong and overwhelmingly positive. I believe anyone can attend the evening BoF sessions and hacking lounge sessions, you just need to register. Check out the Open Source Bridge Blog for more information.

 

The Hacker Lounge Entrance

IMG_20130618_152511_086 (Small).jpg

Registration and Courtyard

IMG_20130618_152532_915 (Small).jpg

 

I'll aim to report again after I finish my presentation. If you're here, don't hesitate to say hello!

With just over a week to go before I speak on Postgres at Open Source Bridge, it is a great time to start freaking out and all at once think of all of the things I need to get ready for in the next couple of months.

 

At Open Source Bridge, I'll be presenting PostgreSQL Replication: The Most Exciting Technology on Earth . If you are familiar with replication, and you stayed awake through the title, then you know exactly what kind of thrill ride to expect. If not, and you plan on going, grab a coffee beforehand and strap on your seat belt. I'm going to talk about the types of replication, what different options are available for the Postgres engine, and what you need to consider when choosing a replication technology. I'll also cover some of the goodies in the 9.3 Beta with respect to replication. Still awake? You should go! It is Thursday 6/20 at 345PM PT.

 

PostgreSQL Replication - The Most Exciting Technology on Earth / Open Source Bridge: The conference for open source citi…

 

All kidding aside though, I am looking forward to my first time speaking at Open Source Bridge. I love meeting new techies, there is no shortage of great conversation and smart people with great ideas. I look forward to meeting you if you are there next week. I know for sure one session I won't miss is my colleague Jesse Hallett's way more exciting Mod Your Android. I am sooo glad he is going before me too, though only a few sessions before me, so I should be in full "I'm speaking shortly time to freak out" mode. Especially when I'm planning to do a live demo.

 

Mod your Android / Open Source Bridge: The conference for open source citizens / June 18-21, 2013 / Portland, OR

 

As soon as I finish, I can focus on the next week, with my first O'Reilly Webcast! Intro to Raspberry Pi will be a presentation that covers the basics of the super-hot Raspberry Pi computer, why you want one, where to get one, what to do with it, and why your kid needs at least one. This will be for the beginner, afterwards I hope you'll be inspired to order your own and make something amazing. Tons of people, young and old, have come up with just amazing ideas with their Raspberry Pis. Check out the RPi Google Group if you want to get a daily dose of innovation. The webcast is Tuesday, June 25 at 10AM PT, but will be available online afterwards. I hope you can make it!

 

Webcast: Intro to Raspberry Pi

 

So far July is shaping up to be 3D printer month. On July 10th, I will be doing another O'Reilly webcast, 3D Printing for Everyone. A couple of years ago, I built an Open Source 3D printer, and ever since then I have been telling everyone how this is going to change our future. The technology has incredible potential, it can help us reduce waste, it gives everyone the potential to create and be innovative, and just like boring 2D printers went from horrible, constantly jamming, barely legible dot matrix printers with holes to inexpensive, fast and perfect quality laser printers, 3D printing is still a little wonky now, but the potential for Star Trek Replicators is truly on the horizon. The webcast aims to be a comprehensive introduction to 3D printing, where it is now, and where it is going. You probably can't have a 3D printer talk without going over the controversial stuff either, and I will be happy to offer my take.

 

At the end of the month at OSCON 2013 I will be going a little deeper into 3D printing, and I will have my printer with me and (hopefully) printing. I'll cover the open source models and the RepRap concept of self-replicating printers (yes, exactly like the Terminator). I would recommend this session to anyone that is ready to get a 3D printer, and wants to learn the pros and cons of building one from a kit vs. buying a completed one. If you are attending OSCON, I'd love to meet you, just let me know!

 

Webcast: 3D Printing for Everyone

My RepRap Printer - The eMaker Huxley: OSCON 2013 - O'Reilly Conferences, July 22 - 26, 2013, Portland, OR

I was working on an edge case that involved a lot of users, and a lot of people following some users. In the 20M range. It would be equivalent to listing Twitter followers for popular people, such as:

prestweet.png

Like most DBAs, I am right up there too

edtweet.png

 

Looking up my IDs in a separate user table, the form of the query I was analyzing was something like this:

President: SELECT * FROM UserRel WHERE relatedUserID = 50000001 AND active = 1 AND state = 'approved' AND retirementdate = 0;

DBA: SELECT * FROM UserRel WHERE relatedUserID = 50000007 AND active = 1 AND state = 'approved' AND retirementdate = 0;


This seemed like a great opportunity to show why a (SELECT *) query can be very bad, as well as explore a feature new to Postgres 9.2, index-only scans, and where we can leverage that for performance.

Let's say my table following users contains the following columns:

ColumnData Type
relationshipidbigint
useridbigint
relateduseridbigint
relnshiptypeidinteger
statevarchar(20)
creationdatebigint
modificationdatebigint
retirementdatebigint
activeinteger
createdbyvarchar(100)
modifiedbyvarchar(100)

 

If I am working on queries in the application side, in my development environment, it is possible that I have 10 or 10K users in my test table, but I probably don't have 25 Million. When I go and query my own followers with

 

SELECT * FROM UserRel WHERE relatedUserID = 50000007 AND active = 1 AND state = 'approved' AND retirementdate = 0;

 

I get 97 rows back. Not a big deal for the app to handle, not a lot of work on the database, especially if I have an index on relatedUserId. In a table that also contains all of the President's followers, it is pretty likely that the query optimizer will go for the btree over a scan of this table. It will return super fast, my app will process and render the query it pulled, and everything will perform great. That will also be true with the more popular DBAs that have have broken 100 followers.

 

Four reasons for not using SELECT * in code

 

1) It makes you do the work twice.

When you see SELECT * FROM UserRel WHERE relatedUserID = 50000007 AND active = 1 AND state = 'approved' AND retirementdate = 0;

This means YOU are going to have to go and hunt down additional information. Like, what are all the columns included in * ? You can't really look at assignments, because they might not be assigning everything. You pretty much have to go to a database and find the data structure you are working with. When the columns are specified, you know exactly what data that query is after, making it easy for you to identify what is wrong or missing. The extra 20 seconds you take in putting in column names can save you a lot more time and potential errors when you are coding against multiple application and schema versions.


2) It makes the database do more work maybe.

I am sometimes on the fence on this one because it is something you would not see in an EXPLAIN, but if you SELECT *, then the DB has to go figure out what those columns are by asking INFORMATION_SCHEMA. I would say that even when specifying columns, the parse operation is going to have to hit INFORMATION_SCHEMA anyway, to validate that it has the columns you are asking for, so it is somewhat of a trade-off. But lets try and run this query a few times on a machine that won't have enough cache to hold it:

SELECT * FROM userrel WHERE relateduserid = '50000001' AND state = 'approved' and retirementdate=0;






Time: 282079.551 ms

Time: 293938.359 ms

Time: 274396.129 ms

Time: 281434.231 ms

Time: 345524.364 ms

Avg: 295274

SELECT relationshipid, userid, relnshiptypeid, creationdate, modificationdate FROM userrel WHERE relateduserid = '50000001' AND state = 'approved' and retirementdate=0;






Time: 301933.497 ms

Time: 295023.368 ms

Time: 269755.984 ms

Time: 293430.448 ms

Time: 273674.244 ms

Avg: 286763

In this not-too-scientific example, it looks like, more or less, you might get faster times when specifying columns, but this is not the most important reason.


3) Why are you asking for something you already know? In the case of our query, we already know the relatedUserId, active, and state columns, so why would we ask Postgres to return those values again? If you already knew them on the app side, you can save the application the extra round trip and assignments. Sometimes this is a savings of a few ms, but if it is part of a method that gets called 100 times a minute, then those saved milliseconds add up to saved hours very rapidly.


4) Math. This is my favorite one!

 

Even if I am storing my characters in unicode, and they all have 20 character values, I would still only be looking at 97 rows, multiplied by:

    • 6 * 8 byte bigints = 48 bytes
    • 2  * 4 byte ints = 8
    • 3 * 20 * 4 byte varchars (+3 for the varchar bit) = 244

 

So ~300 bytes per entire record, total. At 97 records, we are looking at a grand total of 28K. Practically nothing, My 300GB beastly servers could handle this result set 11 Million times. Applications on machines today should be able to chew up and spit out 28K in microseconds. Even at 10X, 100X, 1KX you are talking about relatively trivial amounts of space needed to hold and process this result set.

 

But lets take the President's ~25M followers:

25M * 300 bytes = 7GB. Well now, that's a little more formidable! That is a lot of data for an application to ask to receive from a database. Lots of memory on both sides, lots of data transfer, and fair amount of IO.

 

If I were to exclude just the stuff I know from the predicate, I save ~101bytes, and I am only sending 4.6GB...so my 8GB application server went from dying to breathing, and I still know all of the answers I knew by going after SELECT *. Even still, a closer look would suggest that I don't care who made the record or who modified it, I save another 82 bytes per record and I am down to 2.7GB. Still, a lot of data for an app to swallow all at once, but less than half of what I was going to get on SELECT *. Any application, network and disk will like you more if you are only trying to push 2.6GB of data over a pipe onto disks vs. 7GB.

 

Those should be four compelling reasons not to make SELECT * calls in code. I make SELECT * calls all the time, but almost always it is (1) limited to a few rows and (2) only for discovery purposes.

 

There are, of course, other approaches to challenges like this, and this doesn't solve the entire problem. We have reduced the amount of data that is pushed over the network to the application, but have we done anything for the work the DB has to do? Not really. That's because typical RDBMSes store data row by row, and the rows are stored in data pages. If you need to read a column from a row, you read the entire page, and effectively the entire row. So while we have reduced the result set to the application, the database is still reading all 25M rows when we want to see who is following the President, we have not necessarily saved any work on the DB side.

 

This is where a new Postgres 9.2 feature known as index-only scans might provide some benefit. There are a lot of conditions that need to be in place for an index-only scan to fire, but in many cases, the index-only scan will reduce the amount of data Postgres has to read. Let's try it out!

 

Index Only Scans in Postgres 9.2

 

An index-only scan is a feature that will allow a query result to be returned by accessing only a table's index. Specific conditions are required for one to be invoked, and because it is still cost-based, there are times when you think a coverage index will solve all of your problems, to be disappointed by never seeing it show up in an EXPLAIN plan.

 

Let's say I am still looking at related users, but I want to see what kind of relationship they have. This is captured in the [relnshiptypeid] column in my sample data. With an index on (relateduserid, relnshiptypeid), can I return the result without touching the table?

For the DBA relationships, definitely!

EXPLAIN ANALYZE SELECT relnshiptypeid FROM userrel WHERE relateduserid = 50000007;

                                                                   QUERY PLAN                                                             

------------------------------------------------------------------------------------------------------------------------------------------------

Index Only Scan using jvusrrel_reluserid_typeid on userrel  (cost=0.00..344.02 rows=1267 width=4) (actual time=14.251..14.286 rows=97 loops=1)

   Index Cond: (relateduserid = 50000007)

   Heap Fetches: 0

 

What about if I am looking at the President's relationships?

 

  EXPLAIN ANALYZE SELECT relnshiptypeid FROM userrel WHERE relateduserid = 50000001;

                                                                          QUERY PLAN                                                                      

------------------------------------------------------------------------------------------------------------------------------------------------------------------

Index Only Scan using jvusrrel_reluserid_typeid on userrel  (cost=0.00..1143527.66 rows=11933132 width=4) (actual time=33.110..468194.147 rows=27069389 loops=1)

   Index Cond: (relateduserid = 50000001)

   Heap Fetches: 0

 

It worked, but it was damn expensive! Probably not the way you want to retrieve this information, but if you had to, this would do it without hitting the userrel table.

 

For this particular case, it is likely our coverage index would work great for a lot of users, but over time there would definitely be a reduction in use because of updates. This has to do with the consideration Postgres makes against the visibility map. Does that mean I should drop the index? Probably not, if we consider the data set of users and followers, it is likely that most users have very few followers relative to President Obama, and most users also have very few changes to their set of followers, so you may get a lot of benefit out of an index like this and Postgres 9.2. The way to know for sure is to try it, make sure you are not killing anything with the extra writes, and looking at pg_stat_user_indexes to see how often your coverage indexes are being hit. If the number of hits is consistently going up, then you are using the index and likely reducing IO against the physical table. If it was going up, and stopped, take a look at your VACUUM and ANALYZE settings. If it just never seems to be getting hit, drop it and look for different ways to make things faster.

 

When the index-only scan is invoked, you will save on a couple of things:

  • The amount of data read on the DB side will be limited to the width of your index, not the width of the table rows. If you do the math, this can be a huge savings.
  • The query does not touch the table, which means concurrent operations are reduced on that table, and helps to reduce any concurrency-based performance issues. If you have queries that MUST scan the table, but you have some that can be saved by an index-only scan, it will improve performance and can reduce concurrency-based slowness.

 

This does have more of an advantage on tables that are not updated frequently, as the query optimizer looks at Postgres' visibility map. As the visibility map changes due to updates (between vacuums) if it is updated frequently, there will be more off bits in the map, and the engine will decide not to use an index-only scan.

 

As with all things in databases, the trade-offs are increased writes with more indexes, as well as more space consumed by the database. Testing is the key to finding

 

There are certainly caveats associated with index-only scans. I highly recommend checking out the PG Wiki, as they have an awesome breakdown of how it works in Postgres, what they needed to do to make it work, and what it does (and does not) help. As with every index and query plan, it depends, and it should be tested against a lot of different data distributions.

 

So, abandon SELECT * for these instant benefits:

  • Easier to see what the code is getting from the DB
  • Save lots of data transfer, crunching to\from your application
  • DBAs will kill less kittens

 

Then add the appropriate Postgres 9.2 indexes for your critical queries in order to super-charge the results, and reduce concurrency problems on busy tables.

 

References:

What's new in PostgreSQL 9.2 - PostgreSQL wiki

Index-only scans - PostgreSQL wiki

Filter Blog