TextPipe: Online Help
    Unicode maps
 

Submit feedback on this topic 

 Home  User Assistance   Tutorials   How to Use TextPipe
 Menus: File   Edit   Filters[ Wizards  Convert   Unicode   Add   Remove   Replace   Extract   Special   Maps   Restrict ]  Tools   Window   Help   Advanced

 

A Unicode map allows 2 byte (UTF-16) or 4 byte (UTF-32) Unicode characters to be mapped to a sequence of zero or more output characters in the same way that single byte characters are mapped. The primary difference between ANSI maps and Unicode maps is that a Unicode map only specifies characters that are changed - non-specified characters are controlled by a separate setting (otherwise you would have an unwieldy list of 65536 to 1 million entries to edit and work with).

The Unicode map expects characters to be found in big endian order, i.e. MSB first, so the Unicode code point x00FA would be found in the file as two consecutive characters, 0x00 0xFA, NOT 0xFA 0x00. If the Unicode map does not work for you, you can check your input file format using the Convert\Hex Dump filter, and if necessary, use the Swap UTF-16 word order or Swap UTF-32 word order filter.

A Unicode map filter looks like this:

Type a value in the output string column to define how a Unicode character gets remapped. Click the up and down buttons in the Start Range (Hex) or End Range (Hex) columns or type a new number to change the range of Unicode characters being remapped. New values can be added on the last row, or using the Populate Values panel.

Delete Selected Rows

Deletes the currently selected rows from the grid. Character ranges not defined in the grid are controlled by the Non-entered characters panel.

Open Map File

Opens a Unicode map file, replacing the existing map grid (not the non-entered characters panel values). Maps can be loaded from Excel worksheets (.XLS), Comma Separated Value files (.CSV, the default when the file extension is not recognized) and Tab delimited value files (.TAB).

Save Map to File

Saves the current Unicode map grid to a file (not the non-entered characters panel values). Maps can be saved to Excel worksheets (.XLS), .CSV Comma Separated Value file (the default) or a .TAB Tab-delimited value file (only when the file extension is .TAB). Values are saved exactly as they are shown on screen.

UTF Mode

This drop down specifies whether each input character is 2 bytes (UTF-16) or 4 bytes (UTF-32).

Non-entered characters

This panel controls what happens to characters not found in any range in the grid.

  • Pass through unchanged - characters not found in the grid are passed through unchanged
  • Pass to sub filter. Typically the sub filter is a Script filter - see below for a sample script.
  • Replace with value - characters not found in the grid are replaced with the set value. This fields supports special characters. Using %x (or %4.4x for better formatting) in the replacement value as a placeholder for the hex value of the character. Using %d in the replacement value is a placeholder for the decimal value of the character. e.g.

    Invalid character found 0x%4.4x

    It's also common to set the Replace with value to

    &#%d;

    to convert it to a Unicode entity expressed in decimal, or

    &%4.4x

    to a Unicode entity expressed in hex, or

    \U+%4.4x

    to a Unicode character value.
     
  • Remove -  characters not found in the grid are removed

Populate values

The populate values group makes it easy to set large groups of values. First enter in a starting value and an ending value (in decimal), then click the button corresponding to your choice:

[Clear] Add the specified characters to the grid and set their mapping so that nothing is output when they are encountered.

[Default] Add the specified characters to the grid, and set their mapping so that they are passed through unchanged.

Sample sub filter VBScript

'Log unknown Unicode values to TextPipe error log
dim a
dim errorCount
const maxErrors = 20

function processLine(line, EOL)
if len(line & EOL) = 2 then
'line & EOL is the two characters of the Unicode code point
errorCount = errorCount + 1
if errorCount < maxErrors then
a = Asc(mid(line & EOL,1,1)) * 256 + Asc(mid(line & EOL,2,1))
TextPipe.logError "Invalid code point 0x" & hex( a )
'msgBox "Invalid code point 0x" & hex( a )
elseif errorCount = maxErrors then
TextPipe.logError "Maximum code point Errors reached: " & errorCount
end if
end if

'return nothing - invalid code point is absorbed
processLine = ""
end function


sub startJob()
errorCount = 0
end sub


sub endJob()
end sub


function startFile()
startFile = ""
end function


function endFile()
endFile = ""
end function

See also

Refresh map list
New map
Base map on filter list
Multi-maps
Available maps
Swap UTF-16 word order
Swap UTF-32 word order
Make Big Endian
Make Little Endian
Unicode conversion

 

 

 Contact Us   Support   Community   Tutorials and User Guides (online)
 Copyright © 1999-2006 DataMystic. All rights reserved.