I have a Server 2008 R2 Enterprise system acting as a RDSH server in a vSphere 4.1 Advanced cluster environment. This server experiences intermittent lock-ups during business hours. I am posting in the RDS (TS) forum because I believe the problem
relates specifically to it being an RDSH server. The problem has occurred at inconsistent intervals, with about 30 instances over the last six months. It consistently occurs during use by end-users.
More info about the environment:
- 2008 R2 domain running all R2 DCs at 2008 domain and forest level
- HP ProLiant DL360 G7 hardware running ESXi 4.1 in vSphere 4.1 Advanced cluster
- HP StorageWorks P2000 G3 SAN utilizing 10K and 15K SAS 6.0gbps DP drives over Brocade FC switches
- Almost entirely HP printers installed on server, with a couple of others. Most printers are HP LaserJet 2420s using PCL5 and PCL6 drivers.
- Clients are mixed between XP, Vista, 7 and thin clients based on CE, HP ThinPro (Linux), and WES 2009. All desktops fully patched.
- The server runs Office 2007, Chrome (multi-user install), Firefox, IE9, a proprietary LoB application, AVG Antivirus 2012, ShadowProtect, Adobe Reader, Flash, Java, Sonicwall Terminal Services Agent, and uses Desktop Experience to provide a full Aero
environment where possible.
- When the system locks up, all network communication and VMWare Tools heartbeats cease. On the vSphere console, we are able to issue Ctrl + Alt + Del at the login prompt, which causes the "Press Ctrl + Alt Del" message to go away, as if it is about
to prompt for username and password, but it never does.
- Device redirection is disabled
- The server has four vCPUs and 12GB of RAM assigned to it; it has had between 20 and 50 concurrent users at the time of the crashes
Looking at Event Viewer, there is no one, consistent set of events in any logs that can be correlated to the crash. However, there are several events that can be tied to different crashes.
Set1:
WinLogon 6005
The winlogon notification subscriber <Sens> is taking long time to handle the notification event (Disconnect).
Sevice Control Manager ID 7011
A timeout (120000 milliseconds) was reached while waiting for a transaction response from the SessionEnv service.
DCOM 10010
The server {AAC1009F-AB33-48F9-9A21-7F5B88426A2E} did not register with DCOM within the required timeout
Set2:
Event ID
1000, Interactive Services Detection
A device or program has requested attention. Device or application: C:\Windows\System32\spoolsv.exe. Message title: \\CSR|[HOSTNAME of PRINT SERVER & DC]\{94AFF4B1-B79E-4BA3-B27C-179216BCC082},LocalsplOnly Document Properties
2:00:11
Event ID 7036, Service Control Manager
The Windows Error Reporting Service service entered the running state.
Numerous Event ID 602.
_______________________________________________________________
After much research, we've been led to believe the issue might be caused by problems with HP printers. This HP thread is relevant:
http://h30499.www3.hp.com/t5/Print-Servers-Network-Storage/64-Bit-HP-BiDi-Channel-Components-Installer/td-p/1085299
However, we were able to resolve BiDi component issues. Currently, we still get 602, and I noticed there were some driver problems on some of the printers (wrong driver). I also read an anecdotal report that reverting to PCL 5 can help:
http://forums.citrix.com/thread.jspa?threadID=261933&start=15&tstart=0
I have not yet done this driver change. I did apply the hotfix found in KB2457866
VMware Tools, Windows, and the HP hardware have been updated recently. All software except the custom LoB app is in use at our other clients on 2008 R2 servers in vSphere, and we do not have this problem anywhere else. This environment has nine
other 2008 R2 Standard and Enterprise systems running on the vSphere cluster without issues. The printers are served from a domain controller that does not experience issues.
While I intend to apply some printer driver updates and revert to PCL5, I am posting this in the hopes someone can give us another direction. I am not 100% convinced this is caused by printer issues, since I cannot always correlate printing to the
crashes. I have reviewed several different Technet, Citrix, and HP forum threads with some similarities in symptoms, but none are quite the same. At this point the print drivers and custom LoB software are the primary suspects, but I'm open to
more lines of troubleshooting.
Edit: another symptom to note is that when the hang does occur, vSphere shows CPU, RAM, and disk utilization skyrocket, then drop to nothing as the system locks up and Tools stops receiving information. The host seems to have plenty of resources, and
I'm not inclined to believe the server is under-spec'ed. The same user applications ran in 2003 terminal services on a single-vCPU, 4GB system with weaker disks.