You may have heard the phrase “Pay no attention to the man behind the curtain” which is a line from the movie The Wizard Of Oz. This is the message many infrastructure teams like to sell to DBAs – “Pay no attention to the hardware behind the curtain”. To a certain extent, this is what we like to hear true since the DBAs have enough to focus on in their daily tasks. Blind faith can lead to problems, since the performance of SQL Server can be negatively impacted by poorly selected and configured components behind the scenes. To combat this, a DBA should know how each component works together to form the configuration of the infrastructure behind SQL Server, as well as the details and little minutiae of the server configuration.
Think of your environment and see if you could answer the following questions (many of which I will ask as part of an assessment or onsite engagement):
- Are the servers virtualized? If they are, are any resources over committed?
- How many SQL Servers are sharing the same physical chassis?
- How many network hops does it take to get from the SQL Server to the application servers?
- What uplink speeds are on the switches between your SQL Server and the application servers?
- How much cache is in the SAN?
- What tier storage do your data and log files reside on?
If you can answer some or all, you’re ahead of most DBAs out there. Unfortunately, in my experience working directly with customers over the years, the DBAs are in the position of not knowing any of this stuff. But where do you start to find these answers, let alone understand the infrastructure concepts behind them?
Develop a relationship with each team that supports the infrastructure underneath SQL Server, such as those responsible for storage. Have each team provide you a high level view of your environment and ask questions. Over time, your knowledge will grow and the design decisions you make based on that knowledge will start to have a positive effect in your environment.
These relationships are important both short and long term. Short term, it will come in handy when an issue occurs and you need to reach out to the other teams for help while troubleshooting. Long term, they can be leveraged when you are designing the future state architecture for a new mission critical application. By then, there will be a healthy working relationship between the teams, yielding a much better design.
Here is an example of why understanding the plumbing behind your SQL Server matters and how it can affect something as simple as your daily backups. Many of the physical servers with SQL Server installed are often rack mounted. There can be 6 – 8 servers in one chassis. Those servers share common components like network and power. Why does that matter?
Now think about your backup strategy. It’s in our nature to typically schedule all of the full database backups for a single instance at the same time each night. A typical database backup is limited by several factors, but for this example, assume an average of 75MB/sec can be written to the destination, which in this case is a network-based backup solution. If there are 8 servers on a single chassis and you are backing up across the network at the same time for each physical server (all have SQL Server installed), the chassis and network segments need to be able to handle a total of 75 * 8, or 600MB/sec.
A 1 GB network yields a maximum of 122MB/sec ((1,000,000 Gbps/8 bits)/1024 = 122MB/sec), so you do not have enough throughput to get the job done. Achieving 600 MB/sec means you need a 10 GB network.to achieve the performance you require to support the all of the backup processes running concurrently. These bandwidth requirements do not include any other network needs for the applications or other traffic on the network at the same time, so you may also need even more network throughput. I bet you never thought of your backup performance from that angle!
Since many of you now run some or all of your SQL Server deployments in virtual machines, a similar concept happens. Your virtualization admins could deploy a ton of VMs on the same hardware, creating a situation where one or more of the SQL Server VMs you are responsible for is competing with any number of other VMs for memory, network, and I/O since they share the resources of the hypervisor. This means that you will have to not only understand what is going on from a performance standpoint in the VM, but also see what is happening at the hypervisor level, too. That includes knowing how many VMs are running.
I hope this post inspires you to now want to look behind the curtain and see what is back there. Ask the questions from above as well as others you come up with above. The answers will allow you to be better informed and make smarter decisions in every aspect of your role as a DBA.
Happy Monday, everyone!
SQL Server 2012 is right around the corner, and it’s a release that is right up my alley in the mission critical space with features like availability groups (AGs) as well as Windows Server Core support. I’m looking forward to its RTM and subsequent release into the world.
However, I keep getting one question over and over from various people so I thought I’d address it to a wider audience. The question is a simple one: Will AGs replace or do away with the need for a clustered instance of SQL Server? If you want the short answer, it’s a resounding NO, but read on if you want to know why I’m saying that. For the record, if you were in doubt, both are branded under the designation marketing gave to the mission critical features of SQL Server 2012: AlwaysOn (with no space). Make no mistake about it: they are two different features with a single marketing designation.
First and foremost, let me say this: because this is going to be a quick, down-and-dirty post, there will be no pictures. I’m breaking one of my rules here, because this topic could use one (or more). Hopefully without the pictures all of this will still make sense.
Availability groups and clustered instances of SQL Server (also known as FCIs) have one thing in common: Windows failover clustering. AGs use Windows failover clustering (WSFC) as an underlying platform for things like the listener and to provide quorum (like the optional Witness configured on an instance did for database mirroring [DBM]). It makes sense – use a standard, well-proven process like WSFC and not reinvent the wheel. The cool thing about availability groups is that you do not have to cluster the instances; you can use standalone instances on servers that are clustered (those servers are also known as nodes).
[Edie and side note - This is the one aspect that seems to confuse folks with AGs: requiring WSFC but not needing an FCI. I hope you now understand why.]
Here’s the main difference: FCIs protect an entire instance of SQL Server, while an AG – just like log shipping or DBM – only protects at the database level. AGs are much better than DBM in terms of what they offer (example: readable replicas), but you still need to worry about things like logins at the instance level, SQL Server Agent jobs, or any other object that is not in the database. With SQL Server 2012 you could use contained databases to solve the login issue, but that won’t be an option for everyone. You need to think about both levels of protection, and there are a ton of other factors which would influence how a final solution is built. I do not want you to think this is going to be a black and white decision.
The bottom line: it depends on what your goals are for availability and/or disaster recovery. One nice thing about FCIs and AGs is that they play well together, unlike FCIs and DBM. That latter combination, depending on what you wanted to do, could work well together but it was never 100% ideal. There are certainly some caveats to the FCI/AG combination, but that’s a whole separate topic I’m not going to even touch here. I don’t want to give you the wrong idea – combining isn’t bad nor impossible, but like anything else, you do need to plan properly to have them work optimally.
I hope this clarifies things. If not, let me know in the comments and I’ll address any questions there or in another blog post.
Everything old is new again – well, not really. Many of the core concepts I teach or talk about be it in a session or onsite as part of knowledge transfer on a consulting engagement haven’t changed over the years. Implementations may change, but ultimately, the things that drive them do not. If you care about such arcane concepts like uptime, availability, and mission critical, you understand that downtime is bad. Very bad. Yet I still see and talk to quite a few customers who like to cause themselves unnecessary pain. I mean, it’s perfectly legal here in the USA to own a gun and shoot yourself in the foot, but why would you? It makes no sense.
Move Forward, Not Backward
A big part of staying mission critical is patching your servers – be it an application, the OS, hardware drivers, BIOS … you get the idea. You want to stay supported and up to date. The farther behind you are, the tougher it is to catch up or make the next major upgrade (say, SQL Server 2008 to SQL Server 2012) because you are not at the right minimum level. But that’s not the focus here – the focus is on those more regular patches that get installed such as security updates, browser updates, and so on. Those are the thorns in the proverbial rose bush.
Most companies have outage windows to apply updates and perform other maintenance to servers regardless of OS and other choices. That makes sense and is definitely needed – 100% uptime is largely a fallacy. (Sorry to burst your bubble if you really believed it was possible.) But whether it’s a security patch or a “routine”, more infrequent service pack, care must be taken.
Summertime … And The Updates Are Easy
George Gershwin is probably rolling in his grave, but I parody with the deepest respect. Microsoft has made it painfully easy (emphasis on painful) to update your servers. If you’re connected to the interwebz, you can set the handy, dandy Windows Update feature to download updates automatically. Even better – you can even have Windows automatically install them for you, too. What’s next? A tiled interface that makes Windows harder to use and makes your Server look like Windows Phone? Nah, something that silly could never never happen in a million years. Oh wait …
All kidding and sarcasm aside, look at your production servers: are they set to automatically update? If so, you are NOT mission critical. You may think you are, but you’re not. You’ll never convince me otherwise. Why?
- Mission critical is a mindset. Part of that mindset is caring about things like SLAs, RTOs, and RPOs. Allowing servers to spontaneously download and update themselves – possibly requiring a reboot or worse causing one without you even asking – is 180 degrees from that mindset. You might as well just remove any redundancy and leave your chances to fate, who is a cruel mistress. You NEVER want to see this should you have to log into a production server:
- Updating for the sake of updating is never the right approach. File under “just because you can, doesn’t mean you should”. Key tenets of mission critical are predictability, consistency, and stability. Servers, especially those participating in a clustered configuration (Windows and/or SQL Server), should be evaluated before applying an update to see if they actually need them. Some do NOT apply, and others may be specific to a configuration (such as a cluster). All patches should be tested prior to going on a production box, and some should be given more weight than others. For example, a fix that touches, say, TCP/IP would be one I would test the heck out of since servers don’t rely on networking or anything.
- To add to #2, I’m going to stick on the consistency point for a moment: servers, especially ones considered mission critical, should all be configured the same. You don’t want varying patch levels and such. That spells trouble with a capital T. (Thank you, Meredith Wilson … I’m on a musical roll tonight!)
- If you are going to do anything automatic, just download the updates but for heaven’s sake, don’t let the server automatically apply them. That would be a huge step forward.
The Bottom Line
Look, I realize many of you struggle on a daily basis with your servers and doing this one thing – automatic updates – seemingly makes your lives easier by taking a big task off your proverbial plate. In reality, it may be making it harder. And I also know that in smaller shops especially, you don’t always have the time or (wo)manpower to evaluate each patch/update. I get it. There’s a happy medium somewhere. Don’t turn short term gain into long term pain. There’s no easy or right answer here per se. You have to update. Do what’s best for you, but realize you may blow anything relating to uptime and reliability in the process.
I’m in the process of testing upgrades to SQL Server 2012. In the process, I’m trying different paths. The current one I’m in the middle of is a clustered instance which will go from SQL Server 2005 SP4 –> SQL Server 2008 SP1 –> SQL Server 2008 R2 SP1 CU3 –> SQL Server 2012. The SQL Server 2008 and 2008 R2 installations were slipstreamed.
However, when I went to go upgrade to SQL Server 2008 R2 SP1 CU3 using my slipstreamed media, the upgrade kept failing. When I went to go look at the log files, one log file (SqlBrowser_Cpu32_1) had the culprit:
=== Verbose logging started: 2/13/2012 22:33:26 Build type: SHIP UNICODE 5.00.7601.00 Calling process: C:\Program Files\Microsoft SQL Server\100\Setup Bootstrap\SQLServer2008R2\x64\setup100.exe ===
MSI (c) (A8:4C) [22:33:26:965]: Resetting cached policy values
MSI (c) (A8:4C) [22:33:26:965]: Machine policy value ‘Debug’ is 0
MSI (c) (A8:4C) [22:33:26:965]: ******* RunEngine:
******* Product: C:\Slipstream 2K8 R2 SP1 CU3\1033_ENU_LP\x64\setup\sqlbrowser.msi
…
MSI (s) (34:F0) [22:33:27:012]: MainEngineThread is returning 1639
Info 1639. Invalid command line argument. Consult the Windows Installer SDK for detailed command line help.
;”C:\SLIPSTREAM
…
It looks like the code, for whatever reason, barfed with the spaces only at this point but was OK with it earlier. I renamed the directory to get rid of the spaces and it worked. I entered a bug on Connect here, so go and vote for it.
Hello all. I was putting together some demos for the private class I am delivering this week and I decided to slipstream SQL Server 2008 R2 with SQL Server 2008 R2 SP1 and SQL Server 2008 R2 SP1 CU3. This is fully supported and the process is straightforward. My installation worked fine, and my clustered instance got installed seemingly with no issues. However, I observed the following:
- In the detailed log file (detail.txt), it references version 10.50.2789 – which is what I expect.
- SQL Server thinks it’s at version 10.50.2789 - again, all good.
- Summary.txt does not match – it has version 10.51.2789
See the pic below:

Completed SQL Server 2008 R2 + SP1 + SP1 CU3 install
Has anyone else seen this? I’m not really woried since the engine is reporting the right number and detail.txt is fine. It just doesn’t make sense that one place would have a slightly different version number.
Also, PCUSOURCE and CUSOURCE didn’t show up on the Ready to Install dialog like I’ve seen. Again, no big deal but just found it interesting.
Many people, including myself, spend our own time answering questions on things like the #sqlhelp hash tag on Twitter and on various forums (TechNet/MSDN, SQL Server Central, etc.). This even happens during talks in Q&A sessions. Most people are nice and gracious, but every now and then, you have someone who just keeps chipping away at the stone to hopefully get the answer they really want to hear (even though it’s not possible). 10 observations:
- One thing I see quite often is the lashback when a correct answer is provided but yet the arguments still continue. Here’s the deal: some technologies – clusters are a good example – work the way they work. I didn’t design the feature. For example, the requirement for shared storage for a clustered instance of SQL Server up through SQL Server 2008 R2 (this changes a little in 2012, but only in certain scenarios). You can’t get a Windows failover cluster or a SQL Server clustered instance to work any other way potentially without either unsupported (by MS) hacks or not clustering at all. Don’t hate us if you may have not thought out all aspects of your design, have run into an issue, and don’t like what we have to say. We’re really not trying to egg you on or irk you – we’re just giving facts.
- Don’t expect us to design your massively complex solution for you for free based on 140 characters or a blog post in a free forum. That’s consulting and what most of us do for our day jobs. We’ve got bills to pay. We can point you in the right direction, but hopefully you’re going to base your design on more than a few line response. Those of us who answer questions love to give back, but realize that there’s always going to be some caveats to what you get for free in a smaller response.
- We can’t make you experts in 140 characters or a series of replies. Hopefully you’re not expecting that.
- Help us help you. This can’t be said enough. The more and accurate info we have, the better. I go through this with my clients all the time – it’s hard to make decisions about tomorrow when you don’t know what is going on today, what happened last week, or even last year. You need historical data to be able to make informed decisions.
- You’re probably wasting your time if you ask something like “I need more disk performancebut I’ve got no budget. Will 5 SATA disks work?” You pretty much have already answered your own question. You can do things right, you can do things fast, or you can do things cheap. Those three things generally don’t all work together. Sometimes you can do them right and fast, but never cheap. Fast and cheap are often lumped together, but often leads to never being right. I think you get the point. There’s a tradeoff somewhere.
- Capacity is not just size (memory, storage, etc.). There’s a whole performance aspect you need to consider. That’s why we ask questions like “How many I/Ops do you need?”
- Your workload matters. Know what it is and be able to hopefully articulate that. What advice works for one implementation may be totally wrong for another.
- RT(F)M. Before asking us a question, do some simple homework first. A simple BinGoogle search may yield your answer or point you to a BOL topic to allow you to ask a more directed question. That’s where we’ll provide you with (hopefully) some value.
- Do NOT post any confidential info about your company, app, or data. This may include things like schema, object names, etc. Common sense, but is it really? Find a way to express your question while still keeping sensitive things under wraps. Easier said than done, of course. This is why I put NDAs in place with my customers when we work together. It allows me to do work and protects both them and me.
- I subscribe to the motto that there are no stupid questions, and I truly believe that. Heck, we all started somewhere. It’s one reason I give back to the community. I didn’t get here on sheer will alone - I’ve had help along the way. But really look at #8 above – many questions can be answered that way. What is not easy, though, is trusting the information source it may come from. I’ve seen some sketchy advice put out there (no, I won’t say where), and it’s a free Internet. People can post what they want based on their experiences – just doesn’t make it technically sound advice. OK, I lied – one example which involves shrinking. We all know how shrinking is not a normal best practice. Don’t believe me? Just read some of Paul Randal’s thoughts. Point is, that first linked post gave OK technical instructions, but not necessarily the most sound advice. Learn to differentiate the two. In the case of that original link, I bet the DBA smartly sized the DB for longer term growth. We’ll never know.
Sorry if this sounds cranky, but I see this stuff over and over. We really do want to help you. Just don’t shoot the messenger if you don’t like the message.
Can you believe it’s February already? 2012 is flying by! I’m thankful that I’m crazy busy, and the partnership with Ben is going well (knock on wood). I’m really looking forward to the training class in March and psyched to have another Training Day at SQLBits X (have you registered yet? No? Go do it!). It’s been a fun travel year already. With one trip to Seattle and California under my belt already, I’ve got quite a few trips between now and April (including across the Atlantic to SQLBits in March) with a few more pending.At this rate, I’ll hit 50,000 miles in the air by mid-year. I know, first world problems – cry your crocodile tears for me.
Quite frankly, since late August when my friend Mike passed away (see Life Is Fragile and Fitting Tribute to a Hero for more info), things have been a blur. Last week (January 25 to be exact) would have been his 40th birthday. No wonder it’s been a somewhat “meh” couple of weeks for me. Considering I just had my 40th at the end of November, it’s all still very visceral for me. Little things remind me of him and probably always will. Rest in peace, my friend.
I also blogged a little about my substandard PASS scores. And no, they were not in the 2s, but certainly poor. To be perfectly honest, since I briefly looked at them back in December, I can’t bring myself to really look at them again. I remember some of the comments. Some were spot on. Looking back at the possible reasons for failure (at least in my eyes), like I said in that blog post – any problems fall squarely on my shoulders. I can’t blame anyone else, but I could certainly use some excuses such as:
- I used a build that was not very old and I had just gotten clearance a few days before to use – and right before the holidays so I didn’t have much time to work with it. I knew the risk, and I bit the dirt on that one.
- In retrospect, the death of my friend really impacted me more than I could have ever imagined. Even with killer material and solid demos, I probably still would have had a sub-par PASS by any stretch of the imagination. Again, not an excuse, but looking back I can see where that really weighed on me.
- I was a bit far ahead of the curve. Since it’s not a stretch to say that SQL Server 2012 will RTM soon now that we’re at the Release Candidate and full adoption for many customers is months, if not a year or more away, doing all SQL Server 2012 material that early was a big risk.
So I went big, took risks, and failed. That’s OK – it’s a good learning exercise and I’d probably do it again. Why? Well, I didn’t get where I am today without taking some big leaps of faith in myself (and sometimes those around me) along with some risks along the way. Sometimes you win, sometimes you lose. You can learn from both. If you never experience failure or disappointment (real disappointment, not the “Mommy didn’t buy me the stuffed animal at the toy store” kind), I’m not sure you can ever understand what success looks like. At least that’s how I feel on the matter.
Despite the fact I have done this for years, my PASS scores showed I may need to brush up on presenting, so I’m actually looking to improve my presentation skills. I’ve done a few things already (including getting a few books) that I hope will show some positive results. Last week at the SQL Bare Metal training in Redmond, I got up and did an impromptu 20 minute presentation that felt really good. No deck, just demos; and no preparation. That was a tough audience – all MCMs, MVPs, and other really smart folks who know SQL well. Easy to bomb there. I didn’t. Maybe it is working. You can let me know when you see me present at SQL Saturday in Mountain View or at SQLBits, and even the training class in March.
To be honest, even if I never score in the top 5 or 10 at a conference, I get more satisfaction from people that come up to me – be it 1 day or 2 years later – saying they took something from a talk I did and it helped them. Those are measurable results numbers won’t ever show.
The journey matters, not just the destination. A score is just a number. Having a positive impact on someone is much better than a score in the long term.