Exchange 2016 CU20 breaks Event logs

I don't post often, but when I have something that I need to get out there with regards to my subject matter, this is where I post it. In this instance it relates to Exchange 2016 CU20 and a little known bug.

I found this quite by accident, many months after successful installation of CU20. I encountered no issues at all as it went on, but when I needed to find something in my event logs, it transpired that they had stopped being populated from the point of CU20 installation.

The log I am referring to is known as a crimson log. When you open Event Viewer and go to Applications and Services Logs and expand that section - these are crimson logs. The one I refer to in this post is the MSExchange Management log.

This log was broken and not just on one server. When an analysis was done, it had broken on 10% of my servers (4 out of 40) - which meant that when MS Premier Support tried to replicate the issue, they were able to do so quite easily.

To cut to the chase: the log is not actually broken. Rather, entries are being diverted into the Application Log instead (where they are easily lost due to the noise). The reason for this is that during CU20, setup may incorrectly write a registry key which takes precedence as the registry is read from top to bottom and the new key appears before the crimson log entries. The key is:

HKLM:\SYSTEM\CurrentControlSet\Services\EventLog\Application\MSExchange CmdletLogs\

It should not be there. Because it is, entries from source MSExchange CmdletLogs are routed to the Application Log instead of the MSExchange Management crimson log, which basically goes dormant. Microsoft don't have this listed on their database of eff-ups, but that may be because (like me) nobody looks at this log until they actually need to.

The fix is to remove the invalid key and restart the server.

To identify the presence of the key on your Exchange servers, you can use the following EMS:

Get-ExchangeServer | %{$s = New-PSSession -ComputerName $_.name; Invoke-Command -Session $s{if(Get-ItemProperty 'HKLM:\SYSTEM\CurrentControlSet\Services\EventLog\Application\MSExchange CmdletLogs\' -ErrorAction SilentlyContinue){$env:computername}}; remove-pssession $s}

If it finds it, you'll get the server NetBIOS name listed.

The following PowerShell will remove the key on a local server:

Remove-item 'HKLM:\SYSTEM\CurrentControlSet\Services\EventLog\Application\MSExchange CmdletLogs\' -Force -Recurse

That could be compiled into a remote PowerShell command, but I like to mitigate risk of trashing everything by taking such actions on a local server, which needs to be paced into Exchange Management Mode and restarted anyway (the key being in CurrentControlSet means that it only takes effect when the Registry hive is read i.e. on startup).

After a restart, the original command should create an error which is logged to the Event log you just fixed:

Get-ItemProperty 'HKLM:\SYSTEM\CurrentControlSet\Services\EventLog\Application\MSExchange CmdletLogs\' 

I don't know if this affects any other CUs on other Exchange versions as I am only currently working with Exchange 2016. I have implemented two further CUs since then and not experienced the same issue.

Y2K22

Tasked with recovering from the issue discovered in https://techcommunity.microsoft.com/t5/exchange-team-blog/email-stuck-in-exchange-on-premises-transport-queues/ba-p/3049447 (Email Stuck in Exchange On-premises Transport Queues) on a per server basis, I came up with the following PowerShell.

It's all well and good to provide an automated method, but most of us can't blindly run such scripts and hope that everything goes okay. It didn't go okay in the first place, hence why the issue needs fixing.

Some background:

A break/fix was put in place in conjunction with MS Premier Support:

Get-MalwareFilteringServer | Set-MalwareFilteringServer -BypassFiltering $true

Get-ExchangeServer | where { ($_.IsHubTransportServer -eq "true")} | ForEach{ Invoke-Command -ComputerName $_.Name -ScriptBlock { Restart-Service msexchangetransport } }

I wasn't involved in that (I was busy seeing in the New Year), but I get the point. The Malware Filtering was bypassed as it is only one of many layers of protection in this particular environment and the least important one. It's possibly the same elsewhere for other on-premises Exchange customers.

My mission (and I chose to accept it, as that is what I get paid for), was to fix the underlying issue before re-enabling the Malware Filtering. I don't normally have a need to do anything with it, but quickly observed that it is a bit 'laggy' when commands are run with it. So I wrote the following, which uses the manual rectification method in a set of commands that I can easily interact with and adjust if necessary. They are working well for me now, so I am finding that I can run them all in one hit and just observe. The updating takes at least half an hour per server and the use of timestamps is because I use a start-transcript for all my PS command windows.

Run from an elevated PowerShell prompt.

get-date -f HH:mm

$serverfqdn = ([System.Net.Dns]::GetHostByName($env:computerName)).HostName

. $env:ExchangeInstallPath\bin\RemoteExchange.ps1

Connect-ExchangeServer -auto -AllowClobber

$currentPrincipal = New-Object Security.Principal.WindowsPrincipal([Security.Principal.WindowsIdentity]::GetCurrent())

if($currentPrincipal.IsInRole([Security.Principal.WindowsBuiltInRole]::Administrator) -eq $FALSE){write-host -backgroundcolor yellow -foregroundcolor red "PLEASE RE-RUN AS ADMINISTRATOR"; EXIT}

$p = Get-Process -Name updateservice -ErrorAction SilentlyContinue

if($p){Stop-Process $p}

$path1 = $exinstall + "FIP-FS\Data\Engines\amd64\Microsoft"

$path2 = $exinstall + "FIP-FS\Data\Engines\metadata"

Remove-Item $path1 -Recurse -Force

Remove-Item $path2 -Recurse -Force

New-Item $path2 -Type Directory | out-null

get-service MSExchangeTransport | stop-service

get-Service FMS | start-service

get-service MSExchangeTransport | start-service

Add-PSSnapin Microsoft.Forefront.Filtering.Management.Powershell

cd $exscripts

$origcheck = (Get-EngineUpdateInformation).LastChecked

Get-EngineUpdateInformation

.\Update-MalwareFilteringServer.ps1 $serverfqdn

get-date -f HH:mm

while($origcheck -eq (Get-EngineUpdateInformation).LastChecked){sleep 15}

(Get-EngineUpdateInformation).UpdateStatus

get-date -f HH:mm

while((Get-EngineUpdateInformation).UpdateStatus -eq "UpdateInProgress"){sleep 60}; Get-EngineUpdateInformation

get-date -f HH:mm

There are a few things to note in this script.

1. A standard check that it is being run at an elevated prompt or stop you in your tracks

2. $exinstall allows for non-standard installation path. It is quite common

3. The while statements allow for the Malware update process to kick in. If it doesn't, you may have a proxy issue. Refer to the Exchange blog article to help you out on that one

4. Updating isn't quick. It takes between 30 and 60 minutes for me. But so far (touch wood) it has been successful every time.

5. This is just fixing the stuff under the cover. I will still need to re-enable the Malware Filtering service by setting the bypassfiltering to $false later