handling unicode in powershellestimated read time: 4-5 minutes
This post is inspired by an odd situation I ran into in a project I'm working on. I have the need to pull specific revisions of files out of a git repository, save those files, and then execute the contents. This all worked fine until it didn't. I received some complaints that unicode characters in the files we getting mangled, and sure enough they were. But why? In this post I'll explain what happened to me, and ways you can avoid it yourself.
In the examples below we are going to be working with a file called "PoShUnicodeSample.txt" that contains the following:
Here is some text with a Unicode character embedded: ⁋
NOTE The issue we are discussing in this post seems to be specific to Windows, Linux does not have the same behavior, but everything we are talking about will work on any OS.
Command Specific Encodings
Many commands in PowerShell will take
-Encoding as a parameter. For example, if you want to read a file into a variable, and that file has unicode characters, the following will result in mangled data:
$data = Get-Content -Path "PoShUnicodeSample.txt" $data | Out-File -FilePath "Temp.txt"
If we open "Temp.txt" we'll see the following:
Here is some text with a Unicode character embedded: â‹
Luckily we can fix this with
$data = Get-Content -Path "PoShUnicodeSample.txt" -Encoding UTF8 $data | Out-File -FilePath "Temp.txt"
Tada! We now have a proper unicode encoded output file, right? Almost. If you open the file in a text editor like VSCode it reports the file as being encoded in
UTF16LE. If you look at the
Out-File documentation you'll see the default output encoding is
UTF8NoBOM. If we want straight UTF-8 we have to tell it to use that encoding via
So, if you are working with unicode, and the encoding is important, make sure you are always setting the encoding explicitly. When I was troubleshooting this issue, I thought this solved my issue, but when I put the changes into the project I was working on, I was still seeing the issue. It took a little help from the folks in #PowershellHelp on the SQL Community Slack to get the issue solved.
PowerShell has a set of default encodings it uses for all input and output operations. You can check what your current settings are by looking at the
InputEncoding property of the console:
PS> [Console]::OutputEncoding IsSingleByte : True BodyName : iso-8859-1 EncodingName : Western European (Windows) HeaderName : Windows-1252 WebName : Windows-1252 WindowsCodePage : 1252 IsBrowserDisplay : True IsBrowserSave : True IsMailNewsDisplay : True IsMailNewsSave : True EncoderFallback : System.Text.InternalEncoderBestFitFallback DecoderFallback : System.Text.InternalDecoderBestFitFallback IsReadOnly : True CodePage : 1252 PS> [Console]::InputEncoding IsSingleByte : True BodyName : iso-8859-1 EncodingName : Western European (Windows) HeaderName : Windows-1252 WebName : Windows-1252 WindowsCodePage : 1252 IsBrowserDisplay : True IsBrowserSave : True IsMailNewsDisplay : True IsMailNewsSave : True EncoderFallback : System.Text.InternalEncoderBestFitFallback DecoderFallback : System.Text.InternalDecoderBestFitFallback IsReadOnly : True CodePage : 1252
As you can see, on my system, the default encoding is
iso-8859-1. Yours may be different, and if you are using a Linux system it most certainly will be (it will likely be UTF-8 in that case).
Solving my Problem
As I said at the top of this post, when I encountered this issue I was using
git show to pull the content of a script file from a git repo and store it in a local file. the following syntax will accomplish that:
git show "origin/Branch:path/to/file.txt" | Out-File -FilePath "LocalFile.txt" -Encoding "utf8"
But I found that the unicode characters were STILL being mangled. This is because the default output of the console was not
UTF-8, so any commands executed in that console would output to the
iso-8859-1 encoding. This includes non-powershell commands, like
git. To fix this, we have to change the default encoding of the console to UTF-8:
PS> [Console]::OutputEncoding = [System.Text.Encoding]::UTF8
git show command after that results in the unicode characters being preserved. Success!
If you are always executing scripts under your own PowerShell console, and want to make sure you are always handling unicode data properly, you could add the following to your PowerShell profile:
[Console]::OutputEncoding = [System.Text.Encoding]::UTF8 [Console]::InputEncoding = [System.Text.Encoding]::UTF8
That combined with the
-Encoding parameter used when working with files should cover most of your needs. If you are working in an environment where you don't have access to the profile you'll just have to make sure to include the console encoding changes in your scripts.
Overall PowerShell offers a lot of flexibility around handling different file encodings. Unfortunately it's not overly obvious what encoding you'll end up with if you don't set them explicitly.