HTTP 503s and URL Reservations (IIS 7.0+)

I saw a problem today where a customer was unexpectedly getting 503 errors when attempting to access a directory named /Reports in an application in IIS, which was running on a server that also had Reporting Services 2008 R2 installed.

I verified that the application pool was running and that there were no errors written to the application or system logs around the time of the failing requests. 503s are almost always due to a crashing or stopped application pool in my experience, so this finding surprised me.

The only clue in logs I could find were lines like the following in httperr.log:


2013-01-03 14:24:01 192.168.1.10 42291 10.8.7.6 80 HTTP/1.1 GET /Reports/MyReport.aspx 503 - N/A -
2013-01-03 14:24:02 192.168.1.10 42292 10.8.7.6 80 HTTP/1.1 GET /Reports/MyReport.aspx 503 - N/A -
2013-01-03 14:30:52 192.168.1.10 33052 10.8.7.6 80 HTTP/1.1 GET /Reports/MyReport.aspx 503 - N/A -
2013-01-03 14:30:54 192.168.1.10 33056 10.8.7.6 80 HTTP/1.1 GET /Reports/MyReport.aspx 503 - N/A -
2013-01-03 14:32:34 192.168.1.10 33179 10.8.7.6 80 HTTP/1.1 GET /Reports/MyReport.aspx 503 - N/A -

One thing that stood out is the lack of information in the 503 errors in this case. 503 errors in httperr.log usually have some additional fields populated, like this:


2009-07-26 13:29:29 127.0.0.1 49165 127.0.0.1 80 HTTP/1.1 GET /page.asp 503 1 AppOffline DefaultAppPool

The lack of information in the substatus code and other fields appears to be related to HTTP.sys routing the request to an application that isn’t running. This is caused by URL reservations, which you can list out with the following command:


netsh http show urlacl

The problematic URL reservation was the following one that Reporting Services creates as part of its default configuration:


Reserved URL : http://+:80/Reports/
User: NT AUTHORITY\NETWORK SERVICE
Listen: Yes
Delegate: No
SDDL: D:(A;;GX;;;NS)

URL reservations are also logged in the system log under the category HttpEvent when created, for example:


Log Name: System
Source: Microsoft-Windows-HttpEvent
Date: 1/1/2013 12:08:32 AM
Event ID: 15007
Task Category: None
Level: Information
Keywords: Classic
User: N/A
Computer: MySSRS
Description:
Reservation for namespace identified by URL prefix http://+:80/Reports/ was successfully added.

Web servers that want to receive HTTP requests through HTTP.sys create URL reservations to tell HTTP.sys where to send the HTTP requests.

This reservation was matching requests for /Reports that the user intended to send to the client’s application hosted in IIS. Since the Reporting Services service was stopped, HTTP.sys was trying to hand the request to Reporting Services but failing and returning 503. If it had been running, I assume we’d have been getting 404s instead since no page named MyReport.aspx exists in SSRS.

To fix the problem, I ran the SSRS configuration wizard to change the Report Manager URL, and then HTTP.sys routed requests for the customer’s application as expected. Of course normally you wouldn’t want to just change that without planning, but this case the client wasn’t even using SSRS.

A similar problem is discussed here.

I’m still trying to work out how HTTP.sys determines what application the request is for. I understand that a service SID can be part of the reservation, but there doesn’t appear to be one for the reservations that SSRS creates. If anyone knows the answer, please comment.

Posted in Uncategorized | Leave a comment

Using SQL Server as a time zone converter

In my current job, I frequently have to deal with situations where some of my colleagues are in one time zone, the customer is in another, the customer’s servers are in another, and some logs on the servers are in yet another time zone, and I need to look up equivalent times for all the zones. This should be simple, but just to ensure I’m not confusing myself and making silly mistakes, I like to work it out in writing. Today I found that this can be easily done in T-SQL, for example:

--Convert a DATETIMEOFFSET from UTC to China Standard Time
DECLARE @mydto AS DATETIMEOFFSET = '2012-12-25 19:54 -00:00';
SELECT SWITCHOFFSET(@mydto, '+08:00');

Which returns:

2012-12-26 03:54:00.0000000 +08:00

Posted in Uncategorized | Leave a comment

Pages/sec Myths on Windows

I’ve seen a few questions recently about the meaning of the Windows Perfmon counter Memory:Pages/sec and the range of optimal values for it. This counter is simply the sum of Memory:Pages Input/sec (pages read from disk into memory) and Memory:Pages Output/sec (pages in memory written to disk). Neither counter necessarily refers to the paging file–a misconception I’ve heard at times–but rather to any file being read or written.

Memory throughput on modern systems is typically very high — about 31 gigabytes/second per CPU socket on a properly configured Dell R710, for example. Windows uses 4 KB pages (except when large pages are enabled), so on an R710 this translates into a theoretical maximum of about 8 million pages/sec — a value much higher than I’ve ever seen this counter display.

In practice, at least with spinning disks, you’ll hit a disk I/O bottleneck long before reaching anywhere near that number because pages/sec is going to be limited by the I/O subsystem’s performance. So, I usually don’t see much of a point in worrying about the absolute values of this counter unless you’re seeing some indications of an I/O bottleneck or another problem:

  • High I/O response times in Resource Monitor
  • High values of LogicalDisk:Avg Disk sec/Read
  • High values of LogicalDisk:Avg Disk sec/Write
  • High I/O response times shown in other tools, such as SQL Server’s sys.dm_os_virtual_file_stats
  • Throughput on your I/O subsystem approaching a theoretical limit
  • High CPU usage correlated with a large variation in pages/sec

  • The various target values I hear people propose for this counter seem to come from a few places:

    1. Microsoft’s own documentation, for example this KB article.
    2. Performance analysis tools such as PAL

    As you might have guessed from the age of the KB article, some of the target values that you hear about are based on the performance characteristics of older disks. ClintH, who is one of the developers of PAL, explains further here:

    \Memory\Pages/sec and related counters like Page Inputs/sec measure hard page faults to and from the disk. x86 and x64 Windows computers use a default 4 KB page size by default. If you have a threshold of 20, then you are saying that the disk is not able to handle 80 KB per second (4 KB x 20). Most 7200 RPM disk drives can handle about 10 MB per second. 4 KB x 1000 is 4 MB, so a 7200 RPM disk drive should be able to easily handle 4 MB per second which is the equivalent of 1000 hard page faults per second. This is why I start PAL off with 1000 and why everyone else should update their thresholds to be in line with modern hardware.

    Other proposed target values are so low — 5-10 in the MS KB article above — that I have to wonder if they’re based on performance characteristics of disks made in the 80s and early 90s.

    There is a maximum value for this counter on any system past which performance will degrade, but the value is going vary greatly depending on the details of the I/O subsystem (RAID level, type and number of disks, amount of write cache, etc) and the workload (percentages of read, write, sequential, and random I/O, etc).

    Thank you for taking the time to read, and I welcome any feedback.

    Posted in Uncategorized | Leave a comment

    Power Management on Dell R710

    Just a quick post here. On a Dell R710, you can run the following command to query the BIOS power management settings from the OS:

    omreport chassis pwrmanagement config=profile

    You can run the following to set the profile to maximum performance:

    omconfig chassis pwrmanagement config=profile profile=maxperformance

    I thought I would note these here as it takes me a while to look this up in the Dell manuals.

    If you want to confirm what frequency the CPUs are running at on Windows, try CPU-Z.

    Posted in Uncategorized | Leave a comment

    tcptraceroute

    TIL you can use tcptraceroute to do a fixed-port traceroute to a specific TCP port. This helps avoid misunderstandings where a traceroute fails due to filtering or some other reason unrelated to the actual problem being investigated. For example, today a customer reported being unable to traceroute to live.sagepay.com:

    [root@swordfish ~]# traceroute live.sagepay.com
    traceroute to live.sagepay.com (195.170.169.9), 30 hops max, 60 byte packets
    1 50.56.184.2 (50.56.184.2) 8.080 ms * *
    2 50.56.129.216 (50.56.129.216) 12.503 ms 12.469 ms 12.423 ms
    3 corea.dfw1.rackspace.net (74.205.108.6) 12.398 ms 12.361 ms coreb.dfw1.rackspace.net (74.205.108.22) 12.315 ms
    4 edge4.dfw1.rackspace.net (74.205.108.47) 12.276 ms edge4.dfw1.rackspace.net (74.205.108.49) 12.209 ms edge4.dfw1.rackspace.net (74.205.108.47) 12.208 ms
    5 border2.10ge13-2.rackmanaged-6.ext1a.dal.pnap.net (63.251.32.49) 12.157 ms 12.130 ms 12.097 ms
    6 core2.pc1-bbnet1.ext1a.dal.pnap.net (216.52.191.40) 12.083 ms 10.085 ms core2.pc2-bbnet2.ext1a.dal.pnap.net (216.52.191.105) 10.096 ms
    7 TenGigabitEthernet1-2.ar3.DAL2.gblx.net (208.51.41.57) 11.718 ms 11.690 ms 11.625 ms
    8 204.245.36.118 (204.245.36.118) 213.967 ms 213.931 ms 213.900 ms
    9 ten1-0-0-t6-cr2.router.uk.clara.net (195.157.6.201) 213.836 ms 213.814 ms 213.655 ms
    10 ten0-0-0-t40-cr1.router.uk.clara.net (195.8.68.117) 213.585 ms 213.550 ms 213.504 ms
    11 ten0-0-0-t40-cr2.router.uk.clara.net (195.8.86.162) 203.993 ms 203.888 ms 202.202 ms
    12 ten0-0-0-tcl2-cr1.router.uk.clara.net (195.157.6.22) 202.165 ms 202.139 ms 122.299 ms
    13 g8-2-tcl2-ar1.router.uk.clara.net (195.157.6.226) 125.386 ms 125.353 ms 125.291 ms
    14 80.168.85.62 (80.168.85.62) 120.958 ms 120.924 ms 120.856 ms
    15 * * *
    16 * * *
    17 * * *
    18 * * *
    19 * * *
    20 * * *
    21 * * *
    22 * * *
    23 * * *
    24 * * *
    25 * * *
    26 * * *
    27 * * *
    28 * * *
    29 * * *
    30 * * *

    That looks bad, but it’s misleading. By running a tcptraceroute, I was able to confirm that the packets were in fact reaching the destination:

    [root@swordfish ~]# tcptraceroute live.sagepay.com 443
    Selected device eth0, address 198.101.245.48, port 36488 for outgoing packets
    Tracing the path to live.sagepay.com (195.170.169.9) on TCP port 443 (https), 30 hops max
    1 50.56.184.2 1.511 ms 0.454 ms 0.484 ms
    2 50.56.129.216 3.928 ms 1.793 ms 1.751 ms
    3 coreb.dfw1.rackspace.net (74.205.108.22) 7.559 ms 2.249 ms 2.250 ms
    4 edge4.dfw1.rackspace.net (74.205.108.49) 4.405 ms 1.462 ms 1.471 ms
    5 border2.10ge13-2.rackmanaged-6.ext1a.dal.pnap.net (63.251.32.49) 3.328 ms 2.391 ms 2.436 ms
    6 core2.pc2-bbnet2.ext1a.dal.pnap.net (216.52.191.105) 3.298 ms 2.436 ms 2.507 ms
    7 TenGigabitEthernet1-2.ar3.DAL2.gblx.net (208.51.41.57) 3.559 ms 2.753 ms 2.832 ms
    8 204.245.36.118 185.912 ms 111.975 ms 111.885 ms
    9 ten1-0-0-t6-cr2.router.uk.clara.net (195.157.6.201) 116.007 ms 113.718 ms 113.763 ms
    10 ten0-0-0-t40-cr1.router.uk.clara.net (195.8.68.117) 127.334 ms 113.715 ms 113.580 ms
    11 ten0-0-0-t40-cr2.router.uk.clara.net (195.8.86.162) 203.072 ms 211.562 ms 220.632 ms
    12 ten0-0-0-tcl2-cr1.router.uk.clara.net (195.157.6.22) 115.052 ms 114.169 ms 114.200 ms
    13 g8-2-tcl2-ar1.router.uk.clara.net (195.157.6.226) 230.179 ms 125.152 ms 258.499 ms
    14 80.168.85.62 113.323 ms 112.160 ms 112.549 ms
    15 195.170.169.5 116.257 ms 113.681 ms 113.495 ms
    16 195.170.169.9 [open] 115.830 ms 112.617 ms 112.077 ms

    So either the problem was not live.sagepay.com being totally unreachable from this customer’s server, or it was resolved by the time I looked into it. Of course you could have also confirmed this with a packet sniffer or by verifying that the connection state was getting past the SYN_SENT state.

    This tool will probably be more informative in cases where traffic is not reaching its destination, but network filters prevent traceroute from reporting at which hop the problem is occurring.

    tcptraceroute is available here. There are also packages for RHEL and CentOS.

    Posted in Uncategorized | Leave a comment

    Quick and easy way to see your user agent

    I picked this up from the Coursera SAAS course. On any system with netcat installed, run the following from a terminal:

    nc -l 9999

    Where ’9999′ is replaced by any port on your device that isn’t in use. Then browse to http://localhost:9999 and look at the output in the terminal window.

    For example, I’ll see:

    $ nc -l 9999
    GET / HTTP/1.1
    Host: localhost:9999
    Connection: keep-alive
    User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Encoding: gzip,deflate,sdch
    Accept-Language: en-US,en;q=0.8
    Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3

    Posted in Uncategorized | Leave a comment

    Classic ASP Performance and “Enable ASP server-side script debugging”

    If you see performance issues on a classic ASP site that aren’t explained by resource usage, it’s a good idea to confirm that the “Enable ASP server-side script debugging” option is not enabled. This option forces all ASP requests for the application to run in a single thread, which can cause some major blocking problems and terrible performance. This is mentioned at the following articles on Microsoft’s site:

    PRB: IIS May Hang If You Enable ASP Server-Side Script Debugging

    Debugging ASP Applications in IIS

    These articles refer to IIS 4.0-6.0, but I can confirm that the same setting kills performance on IIS 7 and 7.5, so I assume, but have not confirmed by observation, that the same single-threaded behavior occurs.

    Posted in Uncategorized | Leave a comment

    NTFS File Creation Performance Testing on 2008 R2 (part 2)

    In my previous post, I mentioned that I’d be doing similar performance testing on a share to see if antivirus software has a similar impairment on file creation performance when creating files over a share. Here are the results:

    seconds to create 10k files

    The column labels are a little busy, but basically it looks like for this test, having antivirus software with real-time scanning enabled on the client degraded performance the most, and antivirus software on the server didn’t make much of a difference. You might see a different pattern in other scenarios, with real-time scanning on the server-side potentially making more of a difference on read-heavy workloads.

    Posted in Uncategorized | Leave a comment

    NTFS File Creation Performance Testing on 2008 R2

    I had a chance to do some performance testing recently. I was mainly interested in finding out how much of an impact antivirus software was having on operations involving large numbers of files.

    I ran the following Powershell script to create 10,000 2 KB files:

    $rootDir = 'X:\perftest'
    $baseFile = $rootDir + "base.txt"
    $desiredFileSize = 2*1KB
    $fileCount = 10000
    "start" | Out-File -Filepath $baseFile
    While ($(Get-ChildItem -path $baseFile).Length -lt $desiredFileSize)
    {
    $(Get-ChildItem -path $baseFile).Length | Out-File $baseFile -APPEND
    }
    for($i=1;$i -lt $fileCount;$i++)
    {
    Copy-Item $baseFile "$rootDir\File$i.txt"
    }
    [DateTime]::Now - $startTime|Select TotalSeconds|Write-Host

    I ran this with and without antivirus enabled and with and without 8.3 name creation enabled. The antivirus software was disabled by disabling all antivirus services and rebooting to ensure the antivirus software’s kernel module was unloaded. 8.3 name creation was disabled by running “fsutil 8dot3name set x: 1″, rebooting, and confirming with “fsutil 8dot3name query x:”.

    Note that the last-access timestamp is disabled by default on Windows 2008 R2. This has traditionally been an option to turn off in Windows to improve I/O throughput, but since it is now disabled by default, disabling it was not part of the tests here.

    The test files were deleted and the server rebooted before each test to minimise the impact of filesystem caching. Each test was repeated 3 times, with the test files being deleted and the server rebooted before the repeat. The reported figure in seconds below is an average of the 3 repeats for each test.

    The results were as follows:

    Features Seconds
    10k files, AV off, 8.3 on 21.2
    10k files, AV off, 8.3 off 23.1
    10k files, AV on, 8.3 off 26.3
    10k files, AV on, 8.3 on 26.8
    100k files, AV off, 8.3 off 215
    100k files, AV on, 8.3 off 273.4
    100k files, AV off, 8.3 on 275
    100k files, AV on, 8.3 on 316.9

    NTFS File Creation

    We can get a better idea of where exactly execution time is spent by monitoring function call execution times in Process Monitor. This is pretty simple:

    1) Get Process Monitor from http://live.sysinternals.com or \\live.sysinternals.com
    2) Turn on the Duration column
    3) Run Process Monitor and set an appropriate filter
    4) Save to a CSV, then import to Excel or another tool of your choice

    I found that the vast majority of the time during the test (85%) was spent inside the CreateFile function. Breaking that down by test, we get:

    Features Seconds
    40k files, AV off, 8.3 on 63.3
    40k files, AV off, 8.3 off 64.8
    40k files, AV on, 8.3 on 74.2
    40k files, AV on, 8.3 off 89.9

    Interestingly, calls to CloseFile also took twice as long when antivirus was enabled.

    This really narrows down how antivirus software is affecting the execution time and makes it clear that AV software (and maybe 8.3 name creation) could have a significant impact on a real-world workload like uploading a large number of small image files to a website or inserting many filestream documentations into SQL Server.

    In contrast, 8.3 name creation and realtime AV scanning (at least with the AV product I tested) has no discernible effect on operations involving large amounts of data but relatively few files.

    I thought these results were somewhat interesting. For the tests with 10k files, enabling antivirus software with real-time scanning degraded performance, but enabling 8.3 name creation didn’t consistently degrade performance. For the tests with 100k files, enabling either 8.3 name creation or enabling an antivirus program with real-time scanning degraded performance significantly, and enabling both degraded it the most. This aligns pretty well with my professional experience, where I’ve seen lots of I/O performance problems involving operations on large numbers of files that cleared up after disabling antivirus software, but I’ve never seen a demonstrable improvement from tweaking the 8.3 name creation settings.

    For more information on 8.3 name creation and why it won’t consistently affect performance, see Helge Klein’s post here. I think there might be some edge cases like an MFT that doesn’t fit in memory or hasn’t been read into memory yet where this is more likely to make a difference.

    In the next post I’ll look at whether this pattern of results is reproducible when accessing files over a share.

    Posted in Uncategorized | Leave a comment