Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

67
Dyalog’08

Transcript of Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Page 1: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Dyalog’08

Page 2: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Migrating to Unicode

Morten Kromberg

Workshop at Dyalog’08 - Elsinore

Page 3: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Agenda

• What is Unicode?• V.12 Design Goals• Key Unicode Features• Language Differences

– ⎕DR, ⍋ of char data– Space & Performance

• ”Interop”: Classic vs Unicode– WSs & Component Files– TCP Sockets & Conga– External Vars, Mapped

Files– Own DLLs and Aps

• Native Files– Unicode Text Files (UTF-

8)

• External Interfaces– COM/OLE, Microsoft.NET– ODBC / SQAPL– ⎕NA: A & W win32 calls

• Source Code Management– SALT, SubVersion, Diff Tools

• Planning Migrations

Migrating to Unicode

Dyalog’08 - Elsinore 3

Page 4: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

What is Unicode?

Wikipedia: An industry standard allowing computers to consistently represent and manipulate text expressed in any of the world's writing systems.

• It assigns a number, or code point, to each of approximately 100,000 characters– Including the APL character set.

• The first version of the standard appeared in 1991, support is now becoming “common” on all platforms

Migrating to Unicode

Dyalog’08 - Elsinore 4

Page 5: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Why do we want Unicode?• Obviously: It allows us to write applications

which use text from all the world’s written languages…

• Less obviously, but perhaps more important in the short term:– APL no longer needs it’s own character set (“Atomic

Vector”)– Characters no longer need to be translated on the way

in and out of APL– APL Source Code can be stored in “ordinary” text files

and be handled by “standard” management tools

Migrating to Unicode

Dyalog’08 - Elsinore 5

Page 6: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

What is Unicode in practice?

Char

Name HEX DEC UTF-8

A Latin capital letter A 00041 65 65

Æ Latin capital letter AE 000C6 198 195 134

α Greek small letter alpha 003B1 945 206 177

ؤ Arabic letter waw with hamza above

00624 1572 216 164

⍺ APL functional symbol alpha 0237A 9082 226 141 186

𠀁 CJK ideograph extension B, second

20001 131073

240 160 128 129

Migrating to Unicode

Dyalog’08 - Elsinore 6

• Most often, when someone tells you the data ”is Unicode”, they mean ”UTF-8 encoded”.

Page 7: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Use Google...

Migrating to Unicode

Dyalog’08 - Elsinore 7

Page 8: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Wikipedia too ...

Migrating to Unicode

Dyalog’08 - Elsinore 8

Page 9: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Encodings

Encoding Description

UCS-4 4 bytes per character (= Dyalog ⎕DR type 320). Often used as internal representation on Unix systems.

UCS-2 2 bytes per character (= type 160). The internal format for ”wide” chars under Windows until Win2000.

UTF-8 THE most popular encoding for text files. Identical to ASCII for range 0-127 (= good for Americans). 2 bytes/char from 128-2047, 3 bytes 2048-65535, 4 bytes after that. The only encoding which is independent of ”endian-ness”.

UTF-16 Identical to UCS-2 for most of first plane, but can encode all characters. Replaced UCS-2 on Windows after Win2000.

Migrating to Unicode

Dyalog’08 - Elsinore 9

• ”Unicode” assigns unique numbers to characters. Encodings are ways to represent these numbers on file.

• UCS (Universal Character Set) encodings have a fixed width,UTF (Unicode Transformation Format) encodings are variable width.

Page 10: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Version 12.0 Design Goals• To allow users to develop Unicode applications

(containing all the worlds symbols)• To make the Dyalog IDE a Unicode application

– No more ”translate tables”!

• Avoid having to explain ⎕AV to future generations– Only one ”kind” of characters

• Design should encourage migration– Controlled migration with ”interop” between old & new

apps– No ”Big Bang” data conversion events– Classic & Unicode editions allow ”parallel runs”

Migrating to Unicode

Dyalog’08 - Elsinore 10

Page 11: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Unicode vs Classic

• Unicode Edition:– Character data is defined as Unicode code points– No translation of data as it moves in & out of APL

• Classic Edition:– Character data is defined as indices into ⎕AV– Translate tables used for keyboard, display and file I/O

• Classic will be available so long as a single major customer has not been able to migrate– The price may increase at some point

Migrating to Unicode

Dyalog’08 - Elsinore 11

Page 12: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Key Unicode Features (1)• New Character Data Types 80, 160, 320:

1-, 2-, 4-byte representations of Code Points.

⎕DR 'Hello' 80 ⎕DR '{⍺+⍵}' 160

⎕DR '𠀁𠀁𠀁 ' 320

• NB: One character = one array element!Migrating to Unicode

Dyalog’08 - Elsinore 12

Page 13: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Key Unicode Features (2)• Monadic ⎕UCS converts to and from

code points (self inverse):

⎕UCS 'Hello'72 101 108 108 111

⎕UCS '{⍺+⍵}' 123 9082 43 9077 125

⎕UCS (2*17)+⍳3 𠀁𠀁𠀁

Migrating to Unicode

Dyalog’08 - Elsinore 13

Page 14: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Key Unicode Features (3)• Dyadic ⎕UCS encodes and decodes data as UTF-8,

UTF-16 or UTF-32:

'UTF-8' ⎕UCS 'ABCÆØÅ'65 66 67 195 134 195 152 195 133 'UTF-8' ⎕UCS 240 160 128 129, 240 160 128 130,

240 160 128 131𠀁𠀁𠀁 'UTF-16' ⎕UCS '𠀁𠀁𠀁 '55360 56321 55360 56322 55360 56323

Migrating to Unicode

Dyalog’08 - Elsinore 14

Page 15: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Demo 1 ...

(key features)

Migrating to Unicode

Dyalog’08 - Elsinore 15

Page 16: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Language Differences

• If you are only using APL workspaces, and component files, most code from earlier versions will just load & run

• Potential problems are:– Monadic ⍋ (only real language

difference)– ⎕DR to test for character data– Dyadic use of ⎕DR to ”cast” data– Space usage (char arrays can be larger)

Migrating to Unicode

Dyalog’08 - Elsinore 16

Page 17: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Monadic ⍋

• Due to differences in the internal representation, upgrade without a collation sequence may return different results:

• Give ⍋ a left argument of ⎕AV to maintain the current behaviour

• In many cases where monadic use, ⍋ order does not matter

Migrating to Unicode

Dyalog’08 - Elsinore 17

Classic Unicode

⍋'aA'1 2 ⎕AV⍳'aA‘18 66

⍋'aA'2 1 ⎕UCS 'aA'97 65

Page 18: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Testing for Character Data• This no longer works as expected:

82=⎕DR X • Dyalog recommends:

(10|⎕DR ⍵)∊0 2– The latter is correct in all versions

Migrating to Unicode

Dyalog’08 - Elsinore 18

Page 19: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Dyadic ⎕DR for ”Casting”• Classic (and previous versions): 83 ⎕DR '⍋' ⍝ ⎕AV[⎕IO+198]¯109 ⍝ Via APL+Win tables• Unicode: 83 ⎕DR '⍋' ⍝ ⎕UCS 903575 35 ⍝ 9035 = 256⊥⌽75 35 • The internal representation is different, and

Unicode does NO TRANSLATION• Code which (e.g.) reads characters from native files

and then ”casts” to number using ⎕DR needs work

Migrating to Unicode

Dyalog’08 - Elsinore 19

Page 20: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

More on ⎕DR ... (and ⎕UCS)• Unicode Edition still recognises 82 as an left argument: 82 ⎕DR ¯109⍋

• This returns the same character as in Classic. But: ⎕DR 82 ⎕DR ¯109160 ⍝ Type 82 cannot exist in Unicode

• Conversely, ⎕UCS exists in Classic: ⎕UCS 9035⍋ ⎕UCS 180 ⍝ But must return elements of ⎕AVTRANSLATION ERROR ⍝ Cannot convert to type 82

Migrating to Unicode

Dyalog’08 - Elsinore 20

Page 21: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Space and Time

• Character data will require 2 bytes per element in the Unicode Edition, if it contains APL symbols. No existing APL arrays can need 4 bytes per element.

• Primitives which manipulate or search this data may run more slowly (more data to sift through).

• Comments and character constants in code, and the script form of namespaces and classes, is also affected

Migrating to Unicode

Dyalog’08 - Elsinore 21

Page 22: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Time and Space• When copying functions between Classic and Unicode,

the format needs to be converted – this can be expensive.

• The same applies when reading a ⎕OR “across the line”.• It is not recommended to dynamically import functions

across the Classic/Unicode boundary in production applications.

• Some VERY LARGE functions which could fix in v11.0 may not fix in the Unicode Edition: Lists of names and constants in a function share space with comments.– Proposal to relax all limits on functions may be executed for

version 12.1

Migrating to Unicode

Dyalog’08 - Elsinore 22

Page 23: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Unicode vs Classic

• Use the Unicode Edition if:– You want to develop new applications– You need to manage characters not in ⎕AV now.

• Use the Classic Edition if:– You need other v12+ enhancements, but are not

ready to convert to Unicode yet – Classic is upwards compatible with v11.0 (as usual)

• UE and CE are maintained from single source, and are ”identical” except for character arrays.

• Start planning your migration now! (please!)

Migrating to Unicode

Dyalog’08 - Elsinore 23

Page 24: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

So you want to migrate soon...• If you ”only use APL” (workspaces, component

files, sockets), applications SHOULD just load & run

• If you – Fell for the temptation to use any external tools or

storage media as part of your application – Wrote your own AP’s or DLL’s– Or want to start using data not in ⎕AV

... you may have a little work to do. Let’s take a look!

Migrating to Unicode

Dyalog’08 - Elsinore 24

Page 25: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

”Interop”

• Unicode and Classic editions are designed to inter-operate seamlessly – also with v11 & v10.1

• 12.0 Classic can read and translate Unicode character data found in files, workspaces and on TCP sockets

• Unicode editions will translate data to type 82 when using TCP Sockets and Component files flagged as non-Unicode (for interop with v11 & v10.1)

• If Unicode data contains characters not in ⎕AV => TRANSLATION ERROR

• Unicode editions still recognise 82 as a valid argument to ⎕DR and native file functions, and are able to map data in old native files to ”the same character”.

Migrating to Unicode

Dyalog’08 - Elsinore 25

Page 26: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

”Interop”

• The intention is that users should be able to perform controlled experiments when migrating to Unicode

• No ”Big Bang” data conversion events; old files and workspaces can still be read

• We hope that users will ”reciprocate” by moving as quickly as possibly; it is as easy as we could make it!

Migrating to Unicode

Dyalog’08 - Elsinore 26

Page 27: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Workspaces

• Classic and Unicode editions can load each others workspaces, but:– Classic cannot load (or COPY from) a workspace containing

characters not in ⎕AV (TRANSLATION ERROR)

• The contents of ⎕AV are defined by ⎕AVU, a list of 256 Unicode Code Points:

⎕AV[97+⍳26] ⍝ By default in v12.0, "Dyalog Alt"

ÁÂÃÇÈÊËÌÍÎÏÐÒÓÔÕÙÚÛÝþãìðòõ

⎕AVU[97+⍳26]←9397+⍳26 ⍝ Underscored alphabet (sort of)

⎕AV[97+⍳26] ⍝ Now we have "Dyalog Std” mapping

• When )COPYing from a pre-v12 workspace, ⎕AVU in the target namespace decides how incoming character data is translated. So code written using Alt & Std can be merged and maintain the original looks.

Migrating to Unicode

Dyalog’08 - Elsinore 27

Page 28: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

More on ⎕AVU

• The Dyalog Std font is still in some older (”anglo”) applications

• Dyalog Alt is used across Western Europe• Some countries use fonts created by local distributors:

)copy avu Russian.⎕AVUC:\...avu saved Fri Jun 27 10:00:52 2008 3 50⍴65↓⎕AVABCDEFGHIJKLMNOPQRSTUVWXYZАБВГД⍙ЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮ{€}⊣⌷¨Яабв⍨гдежзийклмнопрстуфхцч[/⌿\⍀<≤=≥>≠∨∧-+÷×?∊⍴~↑↓⍳○*⌈⌊∇∘(⊂⊃∩∪⊥⊤|;,⍱⍲⍒⍋⍉⌽⊖⍟⌹!⍕⍎⍫⍪≡≢шщъы

• The translate table is also used when reading component files and APL data arriving on TCP Sockets

• It has namespace scope, so classes or namespaces can be defined to read data from Classic systems using different languages if necessary

Migrating to Unicode

Dyalog’08 - Elsinore 28

Page 29: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Underscores Must Die!

• There is no Underscored alphabet in Unicode. Underscoring is a form ”emphasis” (like bold or italic). The underscored alphabet is the ONLY incompatibility with the rest of the world and should be phased OUT.

• The APL385 Unicode font incorrectly displays underscores for code points 9398-9423 (decimal). The positions should really display as .. .Ⓐ Ⓩ

• (Don’t ask why circled alphabetics ARE in unicode, while underscores are not – but Dyalog decided to map underscores to this range)

Migrating to Unicode

Dyalog’08 - Elsinore 29

Page 30: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

⎕AV: Just another variable• In the Unicode Edition, the Atomic Vector is only used to

define how to inter-operate with Classic systems. Only characters in ⎕AV can be shared. Assuming the default (Alt) setting:

'Á '∊⎕AVⒶ1 0

• System variable ⎕Ⓐ (name now displays as ⎕Á) should no longer be used. It continues to exist and returns ⎕AV[97+⍳26]

Migrating to Unicode

Dyalog’08 - Elsinore 30

Page 31: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Chars Allowed in Names• The list has not been extended, the following are allowed:

0123456789 (but not as the 1st character in a name) ABCDEFGHIJKLMNOPQRSTUVWXYZ_ abcdefghijklmnopqrstuvwxyz ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝß àáâãäåæçèéêëìíîïðñòóôõöøùúûüþ ∆⍙ Ⓜ

• In a standard font, underscores display as to Ⓐ Ⓩ• I Unicode, all of the above can now be used

simultaneously (previously, the available set depended on whether the Alt or Std font was selected). Russian letters are NOT allowed.

Migrating to Unicode

Dyalog’08 - Elsinore 31

Page 32: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Component File Interop

• Like workspaces, Component Files can be shared between Classic and Unicode editions.

• The same restriction applies: Classic cannot read arrays containing characters not in ⎕AV.

• Files can be marked as non-Unicode, in which case Unicode cannot write characters not in ⎕AV.– All ”small” (32-bit) component files are non-Unicode

• For ordinary APL arrays (no ⎕ORs), the Unicode edition can share files with old versions of APL too.

Migrating to Unicode

Dyalog’08 - Elsinore 32

Page 33: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

File Properties

• New system function ⎕FPROPS allows you to control whether a file may contain Unicode data:

'c:\temp\smallfile' ⎕FCREATE 32 32 'EJSU' ⎕FPROPS 1 ⍝ Endian, Journaled, Size, Unicode

0 0 32 0 'c:\temp\bigfile’ ⎕FCREATE 64 64 'EJSU' ⎕FPROPS 640 0 64 1

• Size defaults to 64 from v12.0 (new startup flag –F32/-F64)• Small address size (32-bit) files are limited to 4Gb in size

and can NOT have the Unicode bit set• Setting Journaling on prevents sharing with v11.0 or earlier

Migrating to Unicode

Dyalog’08 - Elsinore 33

Page 34: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Translation Error on Write• Unicode edition can write to non-Unicode component files:

'{⍺+⍵}' ⎕FAPPEND 32 ⍝ ∧/'{⍺+⍵}'∊⎕AV – fine!

'U' 0 ⎕FPROPS 64 ⍝ Switch Unicode OFF

'𠀁𠀁𠀁 ' ⎕FAPPEND 64 ⍝ Chars not in ⎕AV

TRANSLATION ERROR

'U' 1 ⎕FPROPS 32 ⍝ Not allowed for small files

TRANSLATION ERROR

• If non-Unicode files do not contain namespaces or ⎕ORs, v10.1 and v11.0 can use them

• Note: Large files (64-bit) cannot be used with versions 10.0 or earlier.

Migrating to Unicode

Dyalog’08 - Elsinore 34

Page 35: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Migrating to Unicode

Dyalog’08 - Elsinore 35

Page 36: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

TCP Socket / Conga Interop• TCPSocket objects have an Encoding property:

• The default is None for Char, and Classic for APL• APL sockets are non-Unicode by default to avoid crashing

down-version APL interpreters receiving Unicode data• Conga always sends data in ”native” form, receive will fail

with a TRANSLATION ERROR if data cannot be represented

Migrating to Unicode

Dyalog’08 - Elsinore 36

Encoding Style Meaning

None Char No translation, characters must be in range 0-255.

UTF-8 Char To UTF-8 on send, from UTF-8 on receive

Classic APL Chars transmitted encoded as elements of ⎕AV

Unicode APL Types 80, 160 or 320 used as required

Page 37: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

External Variables

• External Variables are implemented as small span component files (32-bit files) – and can thus NOT contain Unicode data:

'c:\temp\xvar’ ⎕XT'x' x Hello World x←'𠀁𠀁𠀁 ' TRANSLATION ERROR

• External Variables should be seen as a ”deprecated” feature: You will still be able to use existing external variables, but should plan to convert to component files or mapped files at your convenience.

Migrating to Unicode

Dyalog’08 - Elsinore 37

Page 38: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Mapped Files

• Like external variables, the use of APL mapped files (containing APL arrays with header information) should be seen as a deprecated feature.

– Convert to using other mechanisms at your earliest convenience.

• Support for RAW mapped files (where type information is provided when mapping) remains core functionality (and will probably get more important in a world of multicore machines):

32↓102↑80 ¯1 ⎕MAP'c:\Program Files\ComfortKeyboard\changes.txt'Added new interface languages: Latvian, Brazilian Portuguese, Italian.

• Type 82 is NOT supported in the Unicode Edition: Mapped variables are ”in the workspace” and cannot be translated on access.

• To read a raw file written using data type 82, map with data type 83 and the characters extracted by indexing into ⎕AVU.

Migrating to Unicode

Dyalog’08 - Elsinore 38

Page 39: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

(Own) DLLs and APs

• The format for passing APL arrays to Libraries and Auxiliary Processors is unchanged, except that a Unicode Edition will pass character arrays of type 80, 160 or 320

• Dyalog-provided libraries have been upgraded. A number of old Aps like PREFECT are no longer shipped, but v11 versions will continue to work fine with the Classic Edition.

• If you have written your own APs or DLLs which handle character data, these need to be updated to deal with new data types.

• You can return any of the Classic or Unicode character types, they will be translated (subject to the usual TRANSLATION ERROR limitations).

Migrating to Unicode

Dyalog’08 - Elsinore 39

Page 40: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Native Files

• Unicode Edition also still supports type 82, so that old files containing APL characters can be used. They mapping to the ”same characters” - but with a different internal representation:

V11: 'c:\temp\plus'⎕NCREATE ¯1 '{⍺+⍵}' ⎕nappend ¯1V12: ⎕DR ⎕←⎕NREAD ¯1 82 5 0{⍺+⍵}160

Migrating to Unicode

Dyalog’08 - Elsinore 40

Page 41: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Native Files & Unicode

• Unicode Edition supports new data types 80, 160, 320 – reading or writing 1, 2 or 4 bytes at a time (file is UCS-1, -2 or -4 encoded).

• Code Change Possibly Required: The DEFAULT TYPE when appending character arrays is now 80 (was 82):

'plus:’ ⎕NAPPEND ¯2 ⍝ Type 80 (all ANSI) '{⍺+⍵}' ⎕NAPPEND ¯1 ⍝ Type 160 (APL chars) DOMAIN ERROR ⍝ Data cannot be narrowed

• Early Beta versions of 12.0 used the type of the left argument, but this lead to variable numbers of bytes being used when writing depending on the content of an array (160 if a non-ANSI character included).

• If you need to write text containing APL to a native file, use type 160 – or perhaps better, use UTF-8!

Migrating to Unicode

Dyalog’08 - Elsinore 41

Page 42: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Native Files & UTF-8• The most common way to store Unicode data in text files is to

encode it using UTF-8: This is a format understood by ”most” web applications and other Unicode-enabled applications.

text←'plus←{⍺+⍵}' 'UTF-8' ⎕UCS 'plus'

112 108 117 115

'c:\temp\plus.txt' ⎕NCREATE ¯1

(⎕UCS 'UTF-8' ⎕UCS 'plus') ⎕NAPPEND ¯1

⎕CMD 'notepad c:\temp\plus.txt' 'normal’

• Windows Notepad is able to detect that the file is UTF-8 encoded and displays the text correctly.

• The monadic ⎕UCS on the left converts integers in the range 0-255 into one-byte Unicode characters before appending. Integers above 127 would become type 163 (2 bytes per element).

Migrating to Unicode

Dyalog’08 - Elsinore 42

Page 43: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Native Files & UTF-8

• The most common way to store Unicode data in text files is to encode it using UTF-8: This is a format understood by ”most” web applications and other Unicode-enabled applications.

• UCS-2 (2 bytes per character) is supported by many Microsoft apps (like Visual Studio). UCS-2 was the standard until Windows 2000 – now replaced by UTF-16, which is identical to UCS-2 for most data, but expands to 4 bytes when required.

• Applications need to know which encoding has been used. Two common methods of indicating this are ”Byte Order Marks” at the beginning of the file, and (for web pages) HTTP tags.

Migrating to Unicode

Dyalog’08 - Elsinore 43

Page 44: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Byte Order Mark

1st bytes are...

Encoding is therefore probably

EF BB BF UTF-8

FF FE UTF-16 or UCS-2, written by little endian CPU (Intel)

FE FF UTF-16 or UCS-2, big endian

FF FE 00 00 UTF-32 / UCS-4, little endian

00 00 FE FF UTF-32 / UCS-4, big endian

Migrating to Unicode

Dyalog’08 - Elsinore 44

• By convention, the first few bytes of text files are sometimes (but not always) an encoding of U+FEFF, the ”Byte Order Mark”, also known as ”Zero width no-break space”:

• This convention allows applications to ”guess” the encoding used:

• The convention is more common under Windows than Unix/Linux. Sometimes writing the BOM makes things worse...

Page 45: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Reading Text Files

Migrating to Unicode

Dyalog’08 - Elsinore 45

∇ Chars←ReadFile name;nid;signature;nums [1] ⍝ Read ANSI or Unicode character file (Windows) [2] nid←name ⎕NTIE 0 [3] signature←3↑⎕NREAD nid 83 3 0 [4] :If signature≡¯17 ¯69 ¯65 ⍝ UTF-8 (EF BB BF)[5] Chars←⎕NREAD nid 80(¯3+⎕NSIZE nid) 3 [6] Chars←'UTF-8' ⎕UCS ⎕UCS Chars[7] :ElseIf (2↑signature)≡¯1 ¯2 ⍝ LittleEnd UTF-16 (FF FE)[8] Chars←⎕NREAD nid 160(¯1+⎕NSIZE nid)2 [9] :Else ⍝ ANSI [10] Chars←⎕NREAD nid 80(⎕NSIZE nid)0 [11] :EndIf [12] ⎕NUNTIE nid ∇

Page 46: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Writing Text Files

Migrating to Unicode

Dyalog’08 - Elsinore 46

Page 47: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Writing a UTF-8 Web Page

Migrating to Unicode

Dyalog’08 - Elsinore 47

html←'<html>',NL,' <head>',NL html,←' <meta http-equiv="content-type"

content="text/html; charset=UTF-8" />' html,←’ </head>',NL,'<body>',NL html,←’ <font face="APL385 Unicode">' html,←'plus←{⍺+⍵}</font>',NL html,←'</body>',NL,'</html>',NL

'c:\temp\plus.htm'⎕NCREATE ¯1 (⎕UCS 'UTF-8' ⎕UCS html) ⎕NAPPEND ¯1 ⎕NUNTIE ¯1

⎕CMD 'iexplore c:\temp\plus.htm' ''

Page 48: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Web Page: Results

Migrating to Unicode

Dyalog’08 - Elsinore 48

Page 49: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

UTF-8 Files with .NET

Migrating to Unicode

Dyalog’08 - Elsinore 49

Page 50: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

UTF-8 Files with .NET

Migrating to Unicode

Dyalog’08 - Elsinore 50

apltxt←⎕SE.SALT.New 'C:\..\UTF8File' 'c:\temp\apl.txt'

apltxt.Text Compute average in APL: avg←{(+/⍵)÷⍴⍵} apltxt.Text,←⊂'⍝ Morten was here’

System.Text.Encoding.⎕nl -2 ASCII BigEndianUnicode Default Unicode UTF32 UTF7 UTF8

Page 51: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

External Interfaces: COM/.NET• COM/OLE, Microsoft.Net: No problem

– Have been translating chars to UCS-2/UTF-16 ”always”

– Translation code removed in v12 Unicode

• We already saw it in action:

↑System.IO.File.ReadAllLines ⊂'c:\temp\apl.txt'Compute average in APL: avg←{(+/⍵)÷⍴⍵}

Migrating to Unicode

Dyalog’08 - Elsinore 51

Page 52: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

SQAPL / ODBC & Unicode SQA.Connect 'B' 'MS SQL Server' 'pass' 'user’

(not all results displayed in the following) SQA.Columns 'B' 'idioms'0 COLUMN_NAME .. DATA_TYPE TYPE_NAME COLUMN_SIZE id .. 4 int identity 10 exp .. ¯9 nvarchar 400

⎕←data←3 1⊃SQA.Do 'B' 'select * from idioms' 1 {(+/⍵)÷⍴⍵} 2 {⍵/⍳⍴⍵} 3 {(<\⍵)⍳1} data[;2]←{⎕UCS 'UTF-8' ⎕UCS ⍵}¨data[;2] ⍝ Make UTF8

Migrating to Unicode

Dyalog’08 - Elsinore 52

Page 53: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

SQAPL Example (continued)

SQA.Do 'B' 'alter table idioms add utf8exp varbinary(100)' SQA.Prepare 'B.U1' 'update idioms set utf8exp=:<X20: where id=:<I:' ('Bulk' 20) SQA.X 'B.U1' (⌽data) ⍝ Store UTF8

⎕←data←3 1⊃SQA.Do 'B' 'select id,exp,utf8exp from idioms'1 {(+/⍵)÷⍴⍵} {(+/âµ)÷â´âµ}� � �2 {⍵/⍳⍴⍵} {âµ/â³â´âµ}� � � �3 {(<\⍵)⍳1} {(<\âµ)â³1}� � data[;2]≡¨{'UTF-8' ⎕UCS (⎕UCS ⍵)~0}¨data[;3] ⍝ It works!1 1 1

Migrating to Unicode

Dyalog’08 - Elsinore 53

Page 54: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

ODBC / SQAPL Summary• SQAPL 6.0 supports ODBC Unicode data types:

• These can be used in the same was as the single-byte types. In most cases, the choice is automatic (as we have seen).

• Note: The above applies to databases which have Unicode data types. However, Unicode data is often stored in single-byte types, UTF-8 encoded.

• Most of the work will be understanding how to store Unicode in your database – and converting the data (see your Database Manual ).

Migrating to Unicode

Dyalog’08 - Elsinore 54

ODBC Type

SQAPLType Description

WCHAR U ”Wide” fixed-length string

WVARCHAR W ”Wide” variable-length

WLONGVARCHAR Q ”Wide” unlimited-length

Page 55: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

External Interfaces: ⎕NA

• In Classic & previous editions, parameter type C meant untranslated bytes and T meant ”text”, translated to ANSI.

• In Unicode, both are untranslated.• T without a width specification now means ”wide characters

according to the host convention”• Thus: T means T1 in Classic, T2 in Unicode for Windows, and

T4 under Unicode for Unix/Linux• This means that the use of type T (<0T, >0T, =T) should be

portable across Classic/Unicode systems• Some (typically Unix/Linux) system calls expect data to be

UTF-8 encoded: You must use dyadic ⎕UCS to do the translation.

• Future extensions to ⎕NA may provide UTF-8 encoding.

Migrating to Unicode

Dyalog’08 - Elsinore 55

Page 56: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Selection of A or W Functions• Under Windows, Win32 library calls which handle text

are generally available in two variants:– An ANSI (narrow) version with a name ending in A – a Unicode (wide) version with a name ending in W

• For example, the function to display a message box is available as MessageBoxA and MessageBoxW.

• If you specify the character * at the end of a name, this will be replaced by A in Classic and W in the Unicode Edition.

• The intention is to allow you to write code which will work now under Classic and continue to work under Unicode – to facilitate parallel code testing and a controlled migration.

Migrating to Unicode

Dyalog’08 - Elsinore 56

Page 57: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Portable ⎕NA Example

• The following function is portable between Classic and Unicode:

∇ ok←title MsgBox msg;MessageBox [1] ⎕NA 'I user32∣MessageBox* I <0T <0T I' [2] ok←1=MessageBox 0 msg title 1 ⍝ 1=OK, 2=Cancel. ∇

• The function MessageBoxA will be selected by Classic, MessageBoxW by Unicode.

• <0T will mean 1-byte (translated) text under Classic, and 2-byte (untranslated) text under Unicode– Strictly speaking, text should be translated to UTF-16 in Classic,

but this is only required for ”a few” special chars

Migrating to Unicode

Dyalog’08 - Elsinore 57

Page 58: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

APL Source in Unicode Files• SALT (Simple APL Library Toolkit) supports storage of

functions, namespaces and classes in UTF-8 files with a .dyalog extension.

• You can also very easily write your own storage mechanism using Unicode text files. Under .Net it is trivial:

Save: System.IO.File.WriteAllText 'c:\temp\foo.txt' (⎕VR 'foo') System.Text.Encoding.UTF8

Load: ⎕FX System.IO.File.ReadAllText ⊂'c:\temp\foo.txt’

• Without .Net it requires a wee bit more work (as we have seen earlier)

Migrating to Unicode

Dyalog’08 - Elsinore 58

Page 59: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Source Code Management

• Storing APL source in Unicode text files may seem less convenient to the seasoned APL programmer, but there are very significant advantages:

• High quality tools (both free and ”commercial”) built for other languages can be used to edit, compare, manage source, and build systems – without further ado

• Not only does this make it easier to position APL as a tool for ”professional” software development, many of these tools are actually useful (there are some smart people ”out there”)

• Young developers joining your APL team will already be familiar with these tools and feel ”at home” more quickly

• The quality of life of the APL developer need not be sacrificed!

Migrating to Unicode

Dyalog’08 - Elsinore 59

Page 60: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Demo of Source Code Mgt

Migrating to Unicode

Dyalog’08 - Elsinore 60

Page 61: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Demo of Source Code Mgt

Migrating to Unicode

Dyalog’08 - Elsinore 61

Page 62: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Source Code Mgt Demo

• All tools shown here downloaded from internet, none of them knew about APL in any way.

Migrating to Unicode

Dyalog’08 - Elsinore 62

Page 63: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Demo: Working with MyApp

Migrating to Unicode

Dyalog’08 - Elsinore 63

Page 64: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Keyboarding

• Discuss IME vs new Keyboards• Demo new Console Unix/Linux

APLs

Migrating to Unicode

Dyalog’08 - Elsinore 64

Page 65: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Migration Check List

• Are you migrating in order to simplify and stay current, or because you want to support ”foreign” text in your application?– Probably, you should do the former first (or at least

experiment with it), before trying the latter

• For the former, you only need to make sure that your interfaces to external systems (native files, databases etc) work the same way as before– You may need to add checks to prevent the inadvertant entry of

Unicode characters that your external interfaces cannot handle

• For the latter, you need to be sure that external systems ALSO support Unicode, and how they want to exchange data with your application

Migrating to Unicode

Dyalog’08 - Elsinore 65

Page 66: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Think about ...

• (Dyadic) ⎕DR• Monadic ⍋ of char data• APL style TCP Sockets• Interop required with

earlier versions?• External Vars• Mapped Files• Own DLLs and Aps

• Native Files– Need non-⎕AV/ANSI data– Convert to UTF-8?

• Win32 or other system calls via ⎕NA

• Underscores(!)• Switching to SALT /

SubVersion?

Migrating to Unicode

Dyalog’08 - Elsinore 66

Page 67: Dyalog’08. Migrating to Unicode Morten Kromberg Workshop at Dyalog’08 - Elsinore.

Suggested Strategy

• Migrate to v12 Classic, write code which works in both Classic & Unicode.

• Wait until entire user base upgraded to v12.• Move application to Unicode Edition.

• Suggested timeframe for a large application with many interfaces might be 2-4 years.

• Start thinking now!

Migrating to Unicode

Dyalog’08 - Elsinore 67