Emojis in Javascript. Parsing emoji in Javascript is… not… | by Kevin Scott | React Native Cafe | Medium

https://medium.com/reactnative/emojis-in-javascript-f693d0eb79fb · scraped

A digression into Unicode The terms you want to digest are the following: Code point — A numerical representation of a specific Unicode character.Character Code — Another name for a code point.Code Unit — An encoding of a code point, measured in bits. Javascript uses UTF-16.Decimal — A way to represent code points in base 10.Hexadecimal — A way to represent code points in base 16. Let’s demonstrate with an example. Take as our specimen, the letter A.Sad sack A. Cheer up bud, we’re about to turn you into a code point!The letter A is represented by the code point 65 (in decimal), or 41 (in hexadecimal).codePointAt and fromCodePoint are new methods introduced in ES2015 that can handle unicode characters whose UTF-16 encoding is greater than 16 bits, which includes emojis. Use these instead of charCodeAt, which doesn’t handle emoji correctly.Here’s an example of using these methods, courtesy of xahlee.info:console.log( "😸".charCodeAt(0)); // prints 55357, WRONG!console.log( "😸".codePointAt(0)); // prints 128568, correctI will be using hexadecimal representations (\u0041) from now on, because our future regex will be built that way. A few things to note about hexadecimal representation in Javascript: All hexadecimal code points must be 4 characters. If the character code is less than 4 characters, it must be left padded with zeros.// This is invalid\u41// This is valid\u0041 All hexadecimal code points are case insensitive. // these are equivalent"\uD83D""\ud83d" They can be notated in two forms In Javascript, hexadecimal can be represented in two ways: \u0041 and 0x0041. Jump into your browser console and you’ll see the following are equivalent:String.fromCodePoint(0x0041);> 'A''\u0041';> 'A' Back to Emojis Originally, the range of code points was 16 bits, which encompassed the English alphabet (now known as the Basic Multilingual Plane). Now, in addition to that original range, there are 16 more planes (17 total) to choose from.The rest of the planes beyond the BMP are referred to as the “astral planes”, which include emoji. Emoji live on Plane 1, the Supplementary Multilingual Plane.And the Consortium said, let there be emojiWhat do you think the following will produce?"😀".lengthIf you said 1, you are mistaken my friend! The correct answer is 2.In Javascript, a string is a sequence of 16-bit code points. Since emoji are encoded above the BMP, it means that they are represented by a pair of code points, also known as a surrogate pair.So for instance, 0x1F600, which is 😀, is represented by:"\uD83D\uDE00"(The first pair is called the lead surrogate, and the latter the tail surrogate.)Go ahead and copy that surrogate pair into your browser, and you’ll see 😀. Javascript interprets this pair of characters as having a length of 2. That’s why you can’t just do something like:"abc😀".split('')>["a", "b", "c", "�", "�"]So, how do we get the surrogate pair? There’s a great explanation here, and here’s a gist illustrating going from emoji to decimal to surrogate pair and back again:Because of these limitations within Javascript, in order to parse strings containing emoji, we need some fancy footwork. Writing a regular expression Luckily, the internet is awash in smarter folks than I. The lodash library has produced a rock solid emoji regular expression. Is is:(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff])[\ufe0e\ufe0f]?(?:[\u0300-\u036f\ufe20-\ufe23\u20d0-\u20f0]|\ud83c[\udffb-\udfff])?(?:\u200d(?:[^\ud800-\udfff]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff])[\ufe0e\ufe0f]?(?:[\u0300-\u036f\ufe20-\ufe23\u20d0-\u20f0]|\ud83c[\udffb-\udfff])?)*Woof, that’s a monster! Still, we’re enterprising programmers, we’re not afraid of a little regex, right? Let’s reverse engineer this.From the Wikipedia emoji entry, there’s a couple ranges of emoji (many of which have unassigned values, presumably for future emoji):To make this easier, I’m assuming anything in those ranges is emoji. Our audience uses the English alphabet over SMS, so tough luck if I trawl up any other unsuspecting characters. Dingbats They range from U+2700 to U+27BF, so the regular expression for that looks like:[\u2700-\u27bf]/[\u2700-\u27bf]/.test('✊')> true Miscellaneous Symbols and Pictographs These range from U+1F300 to U+1F5FF, with the following surrogate pairs:toUTF16(0x1F300)> "\uD83C\uDF00"toUTF16(0x1F5FF)> "\uD83D\uDDFF"The regex for this range, from lodash’s implementation, is:[\ud800-\udbff][\udc00-\udfff]/[\ud800-\udbff][\udc00-\udfff]/.test(String.fromCodePoint(0x1F5FF))> true/[\ud800-\udbff][\udc00-\udfff]/.test(String.fromCodePoint(0x1F300))> true Supplemental Symbols and Pictographs From U+1F900 to U+1F9FF, with the following surrogate pairs:toUTF16(0x1F910)> "\uD83E\uDD10"toUTF16(0x1F9C0)> "\uD83E\uDDC0"We can reuse the same regex as above:/[\ud800-\udbff][\udc00-\udfff]/.test(String.fromCodePoint(0x1F910))> true/[\ud800-\udbff][\udc00-\udfff]/.test(String.fromCodePoint(0x1F9C0))> true Emoticons From U+1F600 to U+1F64F, with surrogate pairs:toUTF16(0x1F600)> "\uD83D\uDE00"toUTF16(0x1F64F)> "\uD83D\uDE4F"Also covered by that same regex:/[\ud800-\udbff][\udc00-\udfff]/.test(String.fromCodePoint(0x1F600))> true/[\ud800-\udbff][\udc00-\udfff]/.test(String.fromCodePoint(0x1F64F))> true Transport and Map Symbols Includes U+1F680 to U+1F6FF, with surrogate pairs:toUTF16(0x1F680)> "\uD83D\uDE80"toUTF16(0x1F6FF)> "\uD83D\uDEFF"Also covered by that same regex:/[\ud800-\udbff][\udc00-\udfff]/.test(String.fromCodePoint(0x1F680))> true/[\ud800-\udbff][\udc00-\udfff]/.test(String.fromCodePoint(0x1F6FF))> true Miscellaneous Symbols Includes U+2600 to U+26FF, with surrogate pairs:toUTF16(0x2600)> "\u2600"toUTF16(0x26FF)> "\u26FF"We can write a regex for this like so:/[\u2600-\u26FF]//[\u2600-\u26FF]/.test(String.fromCodePoint(0x2600))> true/[\u2600-\u26FF]/.test(String.fromCodePoint(0x26FF))> true lodash’s mysterious other regex There’s another section in the beginning of that original lodash regex we haven’t looked at yet:(?:\ud83c[\udde6-\uddff]){2}If we examine what those characters represent, we get:"\ud83c\udde6"> "🇦""\ud83c\uddff"> "🇿"Holy camoley, what the heck are those? I’ll tell you what those are: those are the regional indicator symbol letters A-Z. These are used to create flags for various countries. For instance:"\ud83c\uddfa"> "🇺""\ud83c\uddf8"> "🇸"// when combining "u" + "s":"\ud83c\uddfa" + "\ud83c\uddf8"> "🇺🇸"So that’s a good section to keep around. The regex so far is:(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff]) Let’s test it out I’m relying on Emoji-data’s json to provide a library of every emoji. When we run this regular expression against that list, we get 746 matches, 99 misses. Let’s go through the misses: Keycaps There are 12 keycap emojis (#️⃣️, *️⃣ and 0️⃣️–9️⃣️), which look like:"\u0030\uFE0F\u20E3"> "0️⃣️""\u0039\uFE0F\u20E3"> "9️⃣""\u0023\uFE0F\u20E3"> "#️⃣""\u002A\uFE0F\u20E3" > "*️⃣"(That middle “\uFE0F’ is optional, by the way.)These are covered by the following:/[\u0023-\u0039]\ufe0f?\u20e3//[\u0023-\u0039]\ufe0f?\u20e3/.test("\u0023\uFE0F\u20E3")> true/[\u0023-\u0039]\ufe0f?\u20e3/.test("\u0039\u20E3")> true Other Miscellaneous Emoji Towards the bottom of the Unicode Block Emoji entry on Wikipedia is the following: Additional emoji can be found in the following Unicode blocks: Arrows (8 codepoints considered emoji), Basic Latin (12), CJK Symbols and Punctuation (2), Enclosed Alphanumeric Supplement(41), Enclosed Alphanumerics (1), Enclosed CJK Letters and Months (2), Enclosed Ideographic Supplement (15), General Punctuation (2), Geometric Shapes (8), Latin-1 Supplement (2), Letterlike Symbols (2), Mahjong Tiles (1), Miscellaneous Symbols and Arrows (7), Miscellaneous Technical (18), Playing Cards (1), and Supplemental Arrows-B (2). Why the heck are these other random emoji scattered around like detritus? I believe the reason is: “because of history”. But I don’t really know. If you know, leave a comment and educate us all!I won’t go through these one by one. You can look in my Github repo for a breakdown of the regex for each block. Suffice to say the regex that covers all these pesky buggers is:[\u0023-\u0039]\ufe0f?\u20e3|\u3299|\u3297|\u303d|\u3030|\u24c2|\ud83c[\udd70-\udd71]|\ud83c[\udd7e-\udd7f]|\ud83c\udd8e|\ud83c[\udd91-\udd9a]|\ud83c[\udde6-\uddff]|[\ud83c[\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|[\ud83c[\ude32-\ude3a]|[\ud83c[\ude50-\ude51]|\u203c|\u2049|[\u25aa-\u25ab]|\u25b6|\u25c0|[\u25fb-\u25fe]|\u00a9|\u00ae|\u2122|\u2139|\ud83c\udc04|[\u2600-\u26FF]|\u2b05|\u2b06|\u2b07|\u2b1b|\u2b1c|\u2b50|\u2b55|\u231a|\u231b|\u2328|\u23cf|[\u23e9-\u23f3]|[\u23f8-\u23fa]|\ud83c\udccf|\u2934|\u2935|[\u2190-\u21ff] Conclusion Which means that… drum roll… the final regex for parsing emojis is:(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff]|[\u0023-\u0039]\ufe0f?\u20e3|\u3299|\u3297|\u303d|\u3030|\u24c2|\ud83c[\udd70-\udd71]|\ud83c[\udd7e-\udd7f]|\ud83c\udd8e|\ud83c[\udd91-\udd9a]|\ud83c[\udde6-\uddff]|[\ud83c[\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|[\ud83c[\ude32-\ude3a]|[\ud83c[\ude50-\ude51]|\u203c|\u2049|[\u25aa-\u25ab]|\u25b6|\u25c0|[\u25fb-\u25fe]|\u00a9|\u00ae|\u2122|\u2139|\ud83c\udc04|[\u2600-\u26FF]|\u2b05|\u2b06|\u2b07|\u2b1b|\u2b1c|\u2b50|\u2b55|\u231a|\u231b|\u2328|\u23cf|[\u23e9-\u23f3]|[\u23f8-\u23fa]|\ud83c\udccf|\u2934|\u2935|[\u2190-\u21ff])Hopefully that dispells some of the confusion around parsing emoji.

▼

Scraped Content

— 1045 words · 2026-02-14 02:59:27 UTC ·

Excerpt

Visibility

Visible to everyone

Reading Status

Related Bookmarks

My Note

Saved!

Annotations

Agent findings

info URL returned 403 (likely bot-blocked, not necessarily broken) health · Jul 20

info Long content (1045 words) has no proposition chunks health · Jun 29

error URL returned 403 health · Jun 29

Export as Markdown