handling unicode in powershell
estimated read time: 4-5 minutesThis post is inspired by an odd situation I ran into in a project I'm working on. I have the need to pull specific revisions of files out of a git repository, save those files, and then execute the contents. This all worked fine until it didn't. I received some complaints that unicode characters in the files we getting mangled, and sure enough they were. But why? In this post I'll explain what happened to me, and ways you can avoid it yourself.
In the examples below we are going to be working with a file called "PoShUnicodeSample.txt" that contains the following:
Here is some text with a Unicode character embedded: ⁋
NOTE The issue we are discussing in this post seems to be specific to Windows, Linux does not have the same behavior, but everything we are talking about will work on any OS.
Command Specific Encodings
Many commands in PowerShell will take -Encoding
as a parameter. For example, if you want to read a file into a variable, and that file has unicode characters, the following will result in mangled data:
$data = Get-Content -Path "PoShUnicodeSample.txt"
$data | Out-File -FilePath "Temp.txt"
If we open "Temp.txt" we'll see the following:
Here is some text with a Unicode character embedded: â‹
Luckily we can fix this with Encoding
!
$data = Get-Content -Path "PoShUnicodeSample.txt" -Encoding UTF8
$data | Out-File -FilePath "Temp.txt"
Tada! We now have a proper unicode encoded output file, right? Almost. If you open the file in a text editor like VSCode it reports the file as being encoded in UTF16LE
. If you look at the Out-File
documentation you'll see the default output encoding is UTF8NoBOM
. If we want straight UTF-8 we have to tell it to use that encoding via -Encoding UTF8
.
So, if you are working with unicode, and the encoding is important, make sure you are always setting the encoding explicitly. When I was troubleshooting this issue, I thought this solved my issue, but when I put the changes into the project I was working on, I was still seeing the issue. It took a little help from the folks in #PowershellHelp on the SQL Community Slack to get the issue solved.
Default Encodings
PowerShell has a set of default encodings it uses for all input and output operations. You can check what your current settings are by looking at the OutputEncoding
and InputEncoding
property of the console:
PS> [Console]::OutputEncoding
IsSingleByte : True
BodyName : iso-8859-1
EncodingName : Western European (Windows)
HeaderName : Windows-1252
WebName : Windows-1252
WindowsCodePage : 1252
IsBrowserDisplay : True
IsBrowserSave : True
IsMailNewsDisplay : True
IsMailNewsSave : True
EncoderFallback : System.Text.InternalEncoderBestFitFallback
DecoderFallback : System.Text.InternalDecoderBestFitFallback
IsReadOnly : True
CodePage : 1252
PS> [Console]::InputEncoding
IsSingleByte : True
BodyName : iso-8859-1
EncodingName : Western European (Windows)
HeaderName : Windows-1252
WebName : Windows-1252
WindowsCodePage : 1252
IsBrowserDisplay : True
IsBrowserSave : True
IsMailNewsDisplay : True
IsMailNewsSave : True
EncoderFallback : System.Text.InternalEncoderBestFitFallback
DecoderFallback : System.Text.InternalDecoderBestFitFallback
IsReadOnly : True
CodePage : 1252
As you can see, on my system, the default encoding is iso-8859-1
. Yours may be different, and if you are using a Linux system it most certainly will be (it will likely be UTF-8 in that case).
Solving my Problem
As I said at the top of this post, when I encountered this issue I was using git show
to pull the content of a script file from a git repo and store it in a local file. the following syntax will accomplish that:
git show "origin/Branch:path/to/file.txt" | Out-File -FilePath "LocalFile.txt" -Encoding "utf8"
But I found that the unicode characters were STILL being mangled. This is because the default output of the console was not UTF-8
, so any commands executed in that console would output to the iso-8859-1
encoding. This includes non-powershell commands, like git
. To fix this, we have to change the default encoding of the console to UTF-8:
PS> [Console]::OutputEncoding = [System.Text.Encoding]::UTF8
Re-running my git show
command after that results in the unicode characters being preserved. Success!
If you are always executing scripts under your own PowerShell console, and want to make sure you are always handling unicode data properly, you could add the following to your PowerShell profile:
[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
[Console]::InputEncoding = [System.Text.Encoding]::UTF8
That combined with the -Encoding
parameter used when working with files should cover most of your needs. If you are working in an environment where you don't have access to the profile you'll just have to make sure to include the console encoding changes in your scripts.
Conclusion
Overall PowerShell offers a lot of flexibility around handling different file encodings. Unfortunately it's not overly obvious what encoding you'll end up with if you don't set them explicitly.