2015/08/30

PowerShell - Remove special characters from a string using Regular Expression (Regex)

Some more string manipulations! Today I'd like to remove the special characters and only keep alphanumeric characters using Regular Expression (Regex).

You might be interested to check a previous article where I showed how to remove diacritics (accents) from some strings, see here: http://www.lazywinadmin.com/2015/05/powershell-remove-diacritics-accents.html


My goal is to be able to keep only any characters considered as letters and any numbers.
If you are familiar with Regex, you could do something simple as using the metacharacter \w or [a-z] type of things. It's great when you only work with english language but does not work when you have accents or diacritics with Latin languages for example.

Preview of the final solution


Regex approaches

Here is a couple of examples using different meta-characters and Unicode techniques.
I stored the string in a variable $String to make it easy to read.

$String = "François-Xavier!?!#@$%^&*()_+\|}{○<>??/ €$¥£¢ \^$.|?*+()[{ 0123456789"



\W Meta-character

# Regular Expression - Using the \W (opposite of \w)
$String -replace '[\W]', ''
The \w metacharacter is used to find a word character. A word character is a character from a-z, A-Z, 0-9, including the _ (underscore) character. Here we use \W which remove everything that is not a word character. This works pretty well but we get an extra underscore character '_'. The diacritics on the c is conserved.



[^a-zA-Z0-9] Ranges

# Regular Expression - Using characters from a-z, A-Z, 0-9
$String -replace '[^a-zA-Z0-9]', ''

Note: The ^ character allows us to get the opposite (inverse) of the regex pattern defined.


This is working well, but the diacritics are removed. (Missing C of "François")


ASCII Ranges

# Regular Expression - Using ASCII
#  See http://www.asciitable.com/
$String -replace '[^\x30-\x39\x41-\x5A\x61-\x7A]+', ''

Same here, we are using specific ranges of ASCII Characters. The diacritics are removed. (Missing C of "François")



UNICODE Specific Code Point

# Regular Expression - Unicode - Matching Specific Code Points
# See http://unicode-table.com/en/
$String -replace '[^\u0030-\u0039\u0041-\u005A\u0061-\u007A]+', ''

Same here again, we are using specific ranges of Unicode Code Point Characters. The diacritics are removed. (Missing C of "François")

UNICODE Categories (This is what I use in my final function)

# Regular Expression - Unicode - Unicode Categories
$String -replace '[^\p{L}\p{Nd}]', ''

Each Unicode character belongs to a certain category. You can match a single character belonging to the "letter" category with \p{L}. Same for Numbers, you can use \p{Nd} for Decimals.
Other cool Example such as \p{N} for any type of numbers, \p{Nl} for a number that looks like a letter, such as a Roman numeral and finally \p{No} for a superscript or subscript digit, or a number that is not a digit 0–9.


This is the method I use in the final function.


Keep some specific characters

Now that I have the main code working. I want to include some Exclusion.
This can easily be done with the slash character, example:

# Regular Expression - Unicode - Unicode Categories
#  Exceptions: We want to keep the following characters: ( } _
$String -replace '[^\p{L}\p{Nd}/(/}/_]', ''



Final Function

Available on my GitHub repository






Thanks for reading! If you have any questions, leave a comment or send me an email at fxcat@lazywinadmin.com. I invite you to follow me on Twitter @lazywinadm / Google+ / LinkedIn. You can also follow the LazyWinAdmin Blog on Facebook Page and Google+ Page.

No comments:

Post a Comment