
You might be interested to check a previous article where I showed how to remove diacritics (accents) from some strings, see here: http://www.lazywinadmin.com/2015/05/powershell-remove-diacritics-accents.html
If you are familiar with Regex, you could do something simple as using the metacharacter \w or [a-z] type of things. It's great when you only work with english language but does not work when you have accents or diacritics with Latin languages for example.
![]() |
Preview of the final solution |
Regex approaches
Here is a couple of examples using different meta-characters and Unicode techniques.I stored the string in a variable $String to make it easy to read.
$String = "François-Xavier!?!#@$%^&*()_+\|}{○<>??/ €$¥£¢ \^$.|?*+()[{ 0123456789"
\W Meta-character
# Regular Expression - Using the \W (opposite of \w) $String -replace '[\W]', ''
The \w metacharacter is used to find a word character. A word character is a character from a-z, A-Z, 0-9, including the _ (underscore) character. Here we use \W which remove everything that is not a word character. This works pretty well but we get an extra underscore character '_'. The diacritics on the c is conserved.
[^a-zA-Z0-9] Ranges
# Regular Expression - Using characters from a-z, A-Z, 0-9 $String -replace '[^a-zA-Z0-9]', ''
Note: The ^ character allows us to get the opposite (inverse) of the regex pattern defined.
This is working well, but the diacritics are removed. (Missing C of "François")
ASCII Ranges
# Regular Expression - Using ASCII # See http://www.asciitable.com/ $String -replace '[^\x30-\x39\x41-\x5A\x61-\x7A]+', ''
Same here, we are using specific ranges of ASCII Characters. The diacritics are removed. (Missing C of "François")
UNICODE Specific Code Point
# Regular Expression - Unicode - Matching Specific Code Points # See http://unicode-table.com/en/ $String -replace '[^\u0030-\u0039\u0041-\u005A\u0061-\u007A]+', ''
Same here again, we are using specific ranges of Unicode Code Point Characters. The diacritics are removed. (Missing C of "François")
UNICODE Categories (This is what I use in my final function)
# Regular Expression - Unicode - Unicode Categories $String -replace '[^\p{L}\p{Nd}]', ''
Each Unicode character belongs to a certain category. You can match a single character belonging to the "letter" category with \p{L}. Same for Numbers, you can use \p{Nd} for Decimals.
Other cool Example such as \p{N} for any type of numbers, \p{Nl} for a number that looks like a letter, such as a Roman numeral and finally \p{No} for a superscript or subscript digit, or a number that is not a digit 0–9.
This is the method I use in the final function.
Keep some specific characters
Now that I have the main code working. I want to include some Exclusion.This can easily be done with the slash character, example:
# Regular Expression - Unicode - Unicode Categories # Exceptions: We want to keep the following characters: ( } _ $String -replace '[^\p{L}\p{Nd}/(/}/_]', ''
Final Function
Available on my GitHub repositoryThanks for reading! If you have any questions, leave a comment or send me an email at fxcat@lazywinadmin.com. I invite you to follow me on Twitter @lazywinadm / Google+ / LinkedIn. You can also follow the LazyWinAdmin Blog on Facebook Page and Google+ Page.
No comments:
Post a Comment