Vim: Cleaning Data-Files 🚧
In this post, we’ll take at look at using Vim to clean textual data.
Unicode Emojis
Download the dataset using:
iwr https://www.unicode.org/Public/UNIDATA/emoji/emoji-data.txt -OutFile emoji.csv
On inspection, the entries look like this:
0023 ; Emoji # E0.0 [1] (#️) number sign
002A ; Emoji # E0.0 [1] (*️) asterisk
0030..0039 ; Emoji # E0.0 [10] (0️..9️) digit zero..digit nine
00A9 ; Emoji # E0.6 [1] (©️) copyright
00AE ; Emoji # E0.6 [1] (®️) registered
203C ; Emoji # E0.6 [1] (‼️) double exclamation mark
2049 ; Emoji # E0.6 [1] (⁉️) exclamation question mark
2122 ; Emoji # E0.6 [1] (™️) trade mark
2139 ; Emoji # e0.6 [1] (ℹ️) information
...
We want them to look like this:
Hex-Value;Type;Sign;Name
0023; Emoji;#️;number sign
002A; Emoji;*️;asterisk
0030..0039; Emoji;0️..9️;digit zero..digit nine
00A9; Emoji;©️;copyright
00AE; Emoji;®️;registered
203C; Emoji;‼️;double exclamation mark
2049; Emoji;⁉️;exclamation question mark
2122; Emoji;™️;trade mark
2139; Emoji;ℹ️;information
...
We can do so by following these steps:
- Delete from end of
Emoji
to(
. - Starting from
)
delete all spaces. Yields last column. - From beginning-of-line from the first space, delete until start of
Emoji
; then insert;
to yield first/second columns. - Delete all comment lines (lines starting with
#
) & blank-lines.
:%normal ^fildf(i;
:%normal ^f)dwi;
:%normal ^f dtEi;
:g/^#/d | g/^$/d
We can now work with these entries using:
Import-Csv -Delimiter ';' emoji.csv
We can also create a JSON API using:
Import-Csv -Delimiter ';' ~\emoji.csv | ConvertTo-Json | Out-File -Encoding utf8 emoji.json