Skip to the content.

Vim: Cleaning Data-Files 🚧

In this post, we’ll take at look at using Vim to clean textual data.

Unicode Emojis

Download the dataset using:

iwr https://www.unicode.org/Public/UNIDATA/emoji/emoji-data.txt -OutFile emoji.csv

On inspection, the entries look like this:

0023          ; Emoji                # E0.0   [1] (#️)       number sign
002A          ; Emoji                # E0.0   [1] (*️)       asterisk
0030..0039    ; Emoji                # E0.0  [10] (0️..9️)    digit zero..digit nine
00A9          ; Emoji                # E0.6   [1] (©️)       copyright
00AE          ; Emoji                # E0.6   [1] (®️)       registered
203C          ; Emoji                # E0.6   [1] (‼️)       double exclamation mark
2049          ; Emoji                # E0.6   [1] (⁉️)       exclamation question mark
2122          ; Emoji                # E0.6   [1] (™️)       trade mark
2139          ; Emoji                # e0.6   [1] (ℹ️)       information
...

We want them to look like this:

Hex-Value;Type;Sign;Name
0023; Emoji;#️;number sign
002A; Emoji;*️;asterisk
0030..0039; Emoji;0️..9️;digit zero..digit nine
00A9; Emoji;©️;copyright
00AE; Emoji;®️;registered
203C; Emoji;‼️;double exclamation mark
2049; Emoji;⁉️;exclamation question mark
2122; Emoji;™️;trade mark
2139; Emoji;ℹ️;information
...

We can do so by following these steps:

:%normal ^fildf(i;
:%normal ^f)dwi;
:%normal ^f dtEi;
:g/^#/d | g/^$/d

We can now work with these entries using:

Import-Csv -Delimiter ';' emoji.csv

We can also create a JSON API using:

Import-Csv -Delimiter ';' ~\emoji.csv | ConvertTo-Json | Out-File -Encoding utf8 emoji.json

References