Monday, July 25, 2016

vCloud Director Logging

I was recently asked how to go about configuring the Log Insight Agent with VMware vCloud Director and thought that I would take the time to document it here for anyone else who is interested.

Logging in vCD is normally handled by log4j and configured by $VCLOUD_HOME/etc/ the official KB located here. You should either use log4j OR the Log Insight Agent, but not both or you will have event duplication.

Log4j Configuration
First a quick overview of the log4j configuration.
1. Open $VCLOUD_HOME/etc/
2. Append "vcloud.system.syslog" to the rootLogger and make sure to not forget the comma before it
 3. At the bottom of the file go ahead and append the below 6 lines outlined in the KB making sure to change your target FQDN.
4. Unfortunately with vCD 5.x you also have to restart the vmware-vcd service for the changes to take effect. Hint: if you don't want to restart the services and take an outage you can continue reading and use the Log Insight Agent instead :)

Log Insight Agent
vCloud Director supports RHEL and CentOS so you only need to worry about the RPM install of the Log Insight Agent. First though, we need to do some prep work on the Log Insight Server.

1. Install the vCD Content Pack - On the Log Insight Server that you will be pointing your LI Agent at you will need to have the vCD Content Pack installed so the Agent Group is available. This is easily done via the Marketplace

2. Create your Agent Group - From the Administration window select Agents and then highlight the vCloud Director Cell Servers pre-defined Agent Group.
Next scroll to the bottom of the page and select Copy Template

3. Next you will need to define a filter that limits this collection to only vCD Cells. My test example here is very basic and limiting to hosts with a certain hostname prefix.
You can see in the bottom section of the agent group the actual files that will be collected by the agent.
By default the agent only collects info level logs but you can easily switch that to debug level logs if you desire. Feel free to check out my very basic sizing calculator on Github if you are curious of the impact of the additional logs. For now, just hit Save Agent Group to continue.

4. Now you are ready for the actual agent installation! You will need to copy the RPM to your vCD cells /tmp directory. The LI Agent will need to be installed and configured on every vCD Server.
Note: At some point after this step you will need to decide when to remove the log4j configuration and when to enable the Agent. I would personally recommend disabling log4j before installing the agent as short term you won't lose any events since the LI Agent will go through all the log files on the server and forward them on.

5. Install the agent via RPM

6.  If you downloaded the agent from the Log Insight server it is supposed to be forwarding to then you don't need to modify the liagent.ini file but if you downloaded it from or another Log Insight Server you will need to update the target hostname.
If you want to be secure you can enable ssl and your /etc/liagent.ini file will look more like the below
Don't forget that you'll need certificates for SSL so follow the full official documentation available here 
At this point you should see that your agents are alive and sending data to your Log Insight Server


Friday, July 8, 2016

Early Boot Windows Debugging - Part 2 - Kernel Debugging over Serial

This post is a continuation of Part 1; I think I shall call it "Help, my ntbtlog.txt isn't being written to disk and I'm flying blind"

Ok, now I need more data because I'm not getting anywhere. Fortunately Windows still has the option to log kernel debugging over serial. A feature I wasn't aware existed util today. That brings up the big question: how do I make that work on a VM and a physical device without a serial port?

First you need to enable virtual printers in VMware Workstation under Edit > Preferences. Without this enabled Workstation can't attach to named pipes.
Next we need to add a virtual serial port to our VM and tell it to output to named pipe
Next accept or change the named pipe (only replace the part "com_1" if you change it) and set it so that "This end is the server" and "The other end is an application".  This means that your VM is the server and you are going to attach an application to the named pipe.
With that out of the way you need to install the Windows Debugging Tools which are included in the Windows SDK. Link for Windows 10 is here. After installing the debugging toolset we need to launch a new kernel debug session.
Go File > Kernel Debug in WinDbg
Next select the COM tab and fill it out with the below settings but replacing the name of the port with your named pipe.
Hit Ok and you should see your debugger start and say it's "Waiting to reconnect..."
Even if you boot the VM at this point you won't get any information first we need to boot to the Windows Repair wizard, go to Troubleshoot > Command Prompt and enable debugging using bcdedit.

bcdedit /bootdebug {bootmgr} on (Windows Boot Manager)
bcdedit /bootdebug on (boot loader)
bcdedit /debug on (OS Kernel debugger)

At this point you can now reboot. In theory this should be all that you need for debugging but I've noticed that the information is still lacking.

Instead have it boot explicitly to debug mode

Now your debug should have much more valuable information, this time pointing to "IOINIT: Built-in driver \Driver\sacdrv failed to initialize with status - 0xc0000037"

Congratulations, you can now see what is actually going on in your OS and where the root of the issue is at more more clarity.

Early Boot Windows Debugging - Part 1 - Basics

I have a Windows Server 2012 VM that will not boot past the Windows splash screen but throws a BSOD with the error "SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (NETIO.SYS). It's been a long while since working on troubleshooting Windows (I primarily use CentOS) but here's what I've found. I don't have the solution yet but I'm recording some tidbits that I found so I will have them later.

First a bit of preamble:

1. Advanced Boot Options - When you select "Enable Boot Logging" this is supposed to write a log file named ntbtlog.txt. However, in this particular case that never happens. This is presumably because it is before the appropriate driver is loaded to write log files. However with 2012 this is conjecture since the latest Microsoft documentation that I can find applies to Server 2000. Regardless of reason, it isn't captured in this instance.
2. This VM was originally running on ESXi but I have exported and OVF to my local VMware Workstation for my troubleshooting.
3. In the below operations I will be referencing "d:\" which is actually the c:\ of the server. It is referenced from the rescue command prompt as d:\ on my system.

Step 1: Boot to the command prompt from the troubleshooting menu in the Automatic Repair wizard
Step 2: Run a chkdsk to verify the filesystem is in working order. My scan came back with required repairs which it corrected. Subsequent scans come back clean.

Command: chkdsk d: /f
Step 2: Run sfc to verify that Windows is ok. This returns that everything is ok

Command: sfc /offbootdir=d:\ /offwindir=d:\Windows /scannow
Step 3 - Just for grins I also ran DISM (Deployment Image Servicing and Management) to check the integrity. It will throw a warning if you don't give it a scratch directory so I just created a temporary one on my drive. This also returns no found corruption.

Command: dism /image=d:\ /cleanup-image /scan-health /scratchdir=d:\temp
So far, so good... except it still won't boot up. I have an existing "twin" of this machine that should match it in most regards so just to be super certain I also run a manual hashing check on netio.sys and sacdrv.sys (more on that file later). The syntax for that is:

certutil.exe -hashfile drivers\netio.sys md5 (or sha1)

The number 1 cause of netio.sys BSOD are driver conflicts according to googling so I start down that road next. An export of all the drivers between the 2 systems shows that they are absolutely identical. Because that doesn't help me I start yanking out drivers to see if it will make a difference.

To get a list of non-Microsoft drivers I again use DISM and find that there are fortunately only 8 to worry about.

Command: dism /image:d:\ /scratchdir:d:\temp /get-drivers

I'm going to start removing drivers to see if that makes any difference. Again, using DISM I start by removing the vmxnet3 driver since it makes the most sense considering a netio.sys error.

Command: dism /image:d:\ /scratchdir:d:\tetmp /remove-driver:oem4.inf

After a reboot, no change. In 1 of my tests I also then proceed to remove the 7 remaining drivers, that also did nothing. Time to get more information.... Queue next post.... 

Friday, July 1, 2016

Log Insight Configuration API Audit and Standalone Remediation Tool - Updated!

For those of you who are interested I have updated the API based audit and remediation tool with a couple new features. After all, what is the use of automation if it isn't user friendly?

1. Better error handling of remediation errors: In the past you would just get a message to the effect of "Something went wrong" but now the tool will pass the HTTP status code and Error Details from the Log Insight Server's response to your remediation request. In the below example you can see this in action.

2. Now includes a wizard to help build a simplified JSON configuration file! Now, without having to create a single bit of JSON you can quickly get value from the tool. The wizard is simplified because let's be honest, if you want the wizard you don't want to answer 250 questions. Because of this some things are assumed/disabled. If you want them then you can simply add it to the code or use the template in the included docs (use the -d switch).

I hope that this helps you get started in seeing the value of using Configuration APIs to manage your Log Insight Servers!