Perl 6 By Example: A Unicode Search Tool

This blog post is part of my ongoing project
to write a book about Perl 6.

If you’re interested, please sign up for the mailing list at the bottom of
the article, or here. It will be
low volume (less than an email per month, on average).


Every so often I have to identify or research some Unicode
characters. There’s a tool called uni in the
Perl 5 distribution App::Uni.

Let’s reimplement its basic functionality in a few lines of Perl 6 code and
use that as an occasion to talk about Unicode support in Perl 6.

If you give it one character on the command line, it prints out a description
of the character:

$ uni ????
???? - U+1f550 - CLOCK FACE ONE OCLOCK

If you give it a longer string instead, it searches in the list of Unicode
character names and prints out the same information for each character whose
description matches the search string:

$ uni third|head -n5
⅓ - U+02153 - VULGAR FRACTION ONE THIRD
⅔ - U+02154 - VULGAR FRACTION TWO THIRDS
↉ - U+02189 - VULGAR FRACTION ZERO THIRDS
㆛ - U+0319b - IDEOGRAPHIC ANNOTATION THIRD MARK
???? - U+1013a - AEGEAN WEIGHT THIRD SUBUNIT

Each line corresponds to what Unicode calls a “code point”, which is usually a
character on its own, but occasionally also something like a U+00300 -
COMBINING GRAVE ACCENT
, which, combined with a a - U+00061 - LATIN SMALL
LETTER A
makes the character à.

Perl 6 offers a method uniname in both the classes Str and Int that
produces the Unicode code point name for a given character, either in its
direct character form, or in the form the code point number. With that, the
first part of uni‘s desired functionality:

#!/usr/bin/env perl6

use v6;

sub format-codepoint(Int $codepoint) {
    sprintf "%s - U+%05x - %sn",
        $codepoint.chr,
        $codepoint,
        $codepoint.uniname;
}

multi sub MAIN(Str $x where .chars == 1) {
    print format-codepoint($x.ord);
}

Let’s look at it in action:

$ uni ø
ø - U+000f8 - LATIN SMALL LETTER O WITH STROKE

The chr method turns a code point number into the character and ord is
the reverse, in other words: from character to code point number.

The second part, searching in all Unicode character names, works by
brute-force enumerating all possible characters and searching through their
uniname:

multi sub MAIN($search is copy) {
    $search.=uc;
    for 1..0x10FFFF -> $codepoint {
        if $codepoint.uniname.contains($search) {
            print format-codepoint($codepoint);
        }
    }
}

Since all character names are in upper case, the search term is first
converted to upper case with $search.=uc, which is short for $search =
$search.uc
. By default, parameters are read only, which is why its
declaration here uses is copy to prevent that.

Instead of this rather imperative style, we can also formulate it in a more
functional style. We could think of it as a list of all characters, which we
whittle down to those characters that interest us, to finally format them
the way we want:

multi sub MAIN($search is copy) {
    $search.=uc;
    print (1..0x10FFFF).grep(*.uniname.contains($search))
                       .map(&format-codepoint)
                       .join;
}

To make it easier to identify (rather than search for) a string of more than
one character, an explicit option can help disambiguate:

multi sub MAIN($x, Bool :$identify!) {
    print $x.ords.map(&format-codepoint).join;
}

Str.ords returns the list of code points that make up the string. With this
multi candidate of sub MAIN in place, we can do something like

$ uni --identify øre
ø - U+000f8 - LATIN SMALL LETTER O WITH STROKE
r - U+00072 - LATIN SMALL LETTER R
e - U+00065 - LATIN SMALL LETTER E

Code Points, Grapheme Clusters and Bytes

As alluded to above, not all code points are fully-fledged characters on their
own. Or put another way, some things that we visually identify as a single
character are actually made up of several code points. Unicode calls these
sequences of one base character and potentially several combining characters as a
grapheme cluster.

Strings in Perl 6 are based on these grapheme clusters. If you get a list of
characters in string with $str.comb, or extract a substring with
$str.substr(0, 4), match a regex against a string, determine the length, or
do any other operation on a string, the unit is always the grapheme cluster.
This best fits our intuitive understanding of what a character is and avoids
accidentally tearing apart a logical character through a substr, comb or
similar operation:

my $s = "øc[COMBINING TILDE]";
say $s;         # ø̃
say $s.chars;   # 1

The Uni type is akin to a string and
represents a sequence of codepoints. It is useful in edge cases, but doesn’t
support the same wealth of operations as
Str. The typical way to go from Str to a
Uni value is to use one of the NFC, NFD, NFKC, or NFKD methods, which
yield a Uni value in the normalization form of the same name.

Below the Uni level you can also represent strings as bytes by choosing an
encoding. If you want to get from string to the byte level, call the
encode method:

my $bytes = 'Perl 6'.encode('UTF-8');

UTF-8 is the default encoding and also the one Perl 6 assumes when reading
source files. The result is something that does the
Blob role; you can access
individual bytes with positional indexing, such as $bytes[0]. The
decode method helps
you to convert a Blob to a Str.

Numbers

Number literals in Perl 6 aren’t limited to the Arabic digits we are so used
to in the English speaking part of the world. All Unicode code points that
have the Decimal_Number (short Nd) property are allowed, so you can for
example use Bengali digits:

say ৪২;             # 42

The same holds true for string to number conversions:

say "৪২".Int;       # 42

For other numeric code points you can use the unival method to obtain its
numeric value:

say "c[TIBETAN DIGIT HALF ZERO]".unival;

which produces the output -0.5 and also illustrates how to use a codepoint
by name inside a string literal.

Other Unicode Properties

The uniprop method
in type Str returns the general category by default:

say "ø".uniprop;                            # Ll
say "c[TIBETAN DIGIT HALF ZERO]".uniprop;  # No

The return value needs some Unicode knowledge in order to make sense of it,
or one could read
Unicode’s Technical Report 44 for the gory details.
Ll stands for Letter_Lowercase, No is Other_Number. This is what
Unicode calls the General Category, but you can ask the uniprop (or
uniprop-bool method if you’re only interested in a boolean result) for
other properties as well:

say "a".uniprop-bool('ASCII_Hex_Digit');    # True
say "ü".uniprop-bool('Numeric_Type');       # False
say ".".uniprop("Word_Break");              # MidNumLet

Collation

Sorting strings starts to become complicated when you’re not limited to ASCII
characters. Perl 6’s sort method uses the cmp infix operator, which does a
pretty standard lexicographic comparison based on the codepoint number.

If you need to use a more
sophisticated collation algorithm, Rakudo 2017.02 and newer offer the
Unicode Collation Algorithm as an
experimental feature:

my @list = <a ö ä Ä o ø>;
say @list.sort;                     # (a o Ä ä ö ø)

use experimental :collation;
say @list.collate;                  # (a ä Ä o ö ø)
$*COLLATION.set(:tertiary(False));
say @list.collate;                  # (a Ä ä o ö ø)

The default sort considers any character with diacritics to be larger than
ASCII characters, because that’s how they appear in the code point list. On
the other hand, collate knows that characters with diacritics belong
directly after their base character, which is not perfect in every language,
but internally a good compromise.

For Latin-based scripts, the primary sorting criteria is alphabetic, the
secondary diacritics, and the third is case.
$*COLLATION.set(:tertiary(False)) thus makes .collate ignore case, so it
doesn’t force lower case characters to come before upper case characters
anymore.

At the time of writing, language specification of collation is not yet
implemented.

Summary

Perl 6 takes languages other than English very seriously, and goes to great
lengths to facilitate working with them and the characters they use.

This includes basing strings on grapheme clusters rather than code points,
support for non-Arabic digits in numbers, and access to large parts of Unicode
database through built-in methods.

Subscribe to the Perl 6 book mailing list

* indicates required

  • Article By :

Random Article You May Like

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*