Oracle Openworld Directory Index

How It All Started
Existing Sybase database and application
Needed to convert to Oracle
Use Oracle Migration Workbench
OMWB works well
I wasn’t told there was multi-byte data in the Sybase database
After the migration to Oracle
Kanji data were missing
Kanji Become Lost
How was Kanji stored in Sybase?
How was application working with Sybase?
Why lost when migrated to Oracle?
How to fix in Oracle?
Would application work with Oracle?
Before Upgrade to Oracle
Moving Sybase Data to Oracle
Moving Sybase Data to Oracle
What happened?
Oracle database is US7ASCII character set
7-bits per character
Import stripped the 8th bit off each byte
8th bit set to 0
8-bit characters are now 7-bit characters
Original character data is lost
8-bit characters can’t be represented in the US7ASCII character set

US7ASCII to WE8ISO8859P1
Fix -- Sybase Data to Oracle
Current Oracle Production
Existing Application
How does it store/retrieve Kanji?
Multi-byte Kanji characters
Stored in WE8ISO8859P1 (single-byte) db
Application
JDBC retrieves bytes from WE Oracle db
Java generates HTML, sent to client browser
Netscape, view HTML using “Japanese (Auto-Detect)” character set
Display Kanji
Existing Application
Existing Application
Each piece of software makes some decision (default) about character set
You need to understand this process for your application
What Really Happened?
Source Kanji data
From EUC-JP character set
Multi-byte
Kanji multi-byte stored in Sybase db
Default character set ISO-1
8-bit, single-byte
Kanji multi-byte stored in Oracle db
Character set WE8ISO8859P1
8-bit, single-byte
Convert to UTF8
Why?
Eliminate all the issues shown so far
Store multiple languages correctly
Correctly encoded
Support clients inserting data in languages other than Japanese Kanji
Existing application can only support languages based on Latin characters and Kanji

Conversion is Simple -- Isn’t It?
Export WE8ISO8859P1 database
Set export client NLS_LANG
AMERICAN_AMERICA.WE8ISO8859P1
Import into UTF8 database
Set import client NLS_LANG
AMERICAN_AMERICA.WE8ISO8859P1
Test application
Application works!
Is everything OK?
Meanwhile, Back at the Ranch
While application testing is going on
Insert sample bytes for Kanji into WE db
Use Oracle SQL CHR() function
Export from WE db, import into UTF8 db
Examine same bytes in UTF8 db
Compare UTF8 bytes to manually generated UTF8 bytes for the Kanji characters
NOT the same bytes!
What does this mean?
UTF8 Encoding Process
Think your life is boring?
Bytes is Bytes
Conversion Issue
Import to UTF8 Conversion
What Happened?
Oracle did exactly what is was told to do
Take bytes from WE database
Convert to UTF8 bytes
Export file was made from WE database
WE is single-byte character set
Convert each byte one at a time to UTF8
Kanji character consists of 2 bytes in WE db
Converting each byte to UTF8 not the same as converting the pair of bytes to UTF8
Yeah, but, application works! (?)
Where’s the Problem?
UTF8 db has Kanji as 0xC2B0, 0xC2A1
Correct UTF8 encoding is 0xE4BA9C
If new, correctly encoded Kanji is inserted
Database contains two sets of bytes for same Kanji character
How does app deal with this?
Existing app only works using Netscape Japanese (Auto-Detect) character set
App is not really UTF8, only works for Japanese characters
How Does Application Work?
Review
Oracle db created using UTF8 character set
Java retrieves char data (bytes) from UTF8 db
Converts to UCS2 (Unicode)
Java code generates HTML
Client browser displays Kanji characters
Netscape, “Japanese (Auto-Detect)” char set
Application still works
bytes in UTF8 db don’t represent UTF8 encoded Kanji
Application Works (?)
Test Application
Insert bytes for correctly encoded Kanji
Into UTF8 db
Use CHR() function
Display this data using existing application
Does NOT display Kanji!
Using “Japanese (Auto-Detect)” character set
Try Netscape UTF8 character set
Doesn’t display Kanji
UTF8 character set should work, shouldn’t it?
Where Are We?
Correctly encoded UTF8 multi-byte character data for Kanji does not work with existing application
Simply “converting” (export WE, import to UTF8) doesn’t result in correctly encoded UTF8 character data
Need to figure out what app code is doing
Whoever wrote it is gone
The usual state of affairs

How To Debug App Code?
Don’t use app code
write very simple Java Servlet
(The Java Diva helps with this…)
Servlet simply retrieves character data from db
Runs in iPlanet web server
generates HTML for client browser
Use servlet to retrieve correct UTF8 Kanji
Does not display Kanji!
Fix servlet then can fix application code?
Modified Servlet Code
Fix Application
Make same changes to application code
Browser displays Kanji correctly
Manually generated, correctly encoded UTF8
Application interacts with Dynamo
Need to reconfigure Dynamo for UTF8 data
Application fixed (?)
Works with correctly encoded UTF8 multi-byte data
Is Application really fixed?
Fixed app retrieves correctly encoded UTF8 character data
What about existing character data?
Data that was exported from WE and imported into UTF8 db
Use fixed app code to retrieve existing data
Existing Kanji are not displayed
Original app did display existing data...
Existing data is not correctly encoded UTF8

Fixed Application
How To Fix Existing Data?
What’s wrong with existing data (UTF8 db)
Character data is not correctly encoded UTF8
It is UTF8 encoded Unicode of each single byte that was exported from WE database
Before importing into UTF8 database?
EUC-JP character set (Latin ASCII and Kanji)
Stored in single-byte WE database
Need to convert UTF8 of WE of EUC-JP to correct UTF8 bytes for Kanji
Review of Bytes is Bytes
Original Kanji character 0xB0A1 (EUC-JP)
Inserted into Oracle database
0xB0, 0xA1 in WE8ISO8859P1 db
Exported/imported into Oracle UTF8 db
Individual bytes converted to UTF8
Original Kanji character was 2 bytes
Became 4 bytes in UTF8 db
0xC2B0, 0xC2A1
Correct UTF8 bytes are 0xE4BA9C

How to Convert Existing Data?
Fix in Oracle WE before export/import
No point, export/import will ‘corrupt’ character data, will need to fix after export/import
Don’t export/import
SQL select each table to flat files from WE db
SQL*Loader into UTF8 database
Use CHARACTERSET JA16EUC option
More work moving each table one at a time
SQL*Loader Option
Convert Existing Data
Fix data after import into UTF8 database
Export from WE, import into UTF8 database
Use Oracle SQL CONVERT() function
CONVERT() from UTF8 to WE8ISO8859P1
CONVERT() from JA16EUC to UTF8
Need to CONVERT() each column of each table that contains multi-byte data
How to be sure which columns to CONVERT()?
CONVERT() all columns that contain char data?
Must test using CONVERT() to verify it works

Fix After Import
Oracle CONVERT()
Syntax, examples
select CONVERT(, , )
select CONVERT(, WE8ISO8859P1, UTF8)
select CONVERT(, UTF8, JA16EUC)
Don’t re-run CONVERT() without testing
re-run may corrupt data
regenerate original source data, re-run CONVERT()
Overall Conversion Process
What we did…
Identify tables/columns contain multi-byte data
Export from WE database
Import into UTF8 database
rows=n, create tables, don’t load data
Widen columns for UTF8 multi-byte data
increase to 3 times original width
Import into UTF8 database (again)
ignore=y, load data into existing tables
Overall Conversion Process
Continued
CONVERT() columns that contain multi-byte data
Test, compare with data from existing application/data
Conversion includes converting all pieces of the application, not just the Oracle database
Details - Source Char Set?
How did I determine this?
Original Kanji data was from EUC-JP
How was this determined?
Examine bytes of original character data
Display Original Kanji characters
Find single Kanji in Japanese dictionary
Gives row-cell code of Kanji in JIS-0208
Using other reference sources
manually generate bytes for the Kanji in various encodings
Compare with bytes of original Kanji data
Rosetta Stone?
Reference Books Used
Lessons Learned
Oracle (and Sybase) don’t store characters
They store bytes, strings of bytes
Normally, Oracle does NO checking of character set
does NOT check that bytes inserted represent correct characters in database character set
Only under specific circumstances does Oracle “apply” a character set to char data
Changing character set affects more than just the database
Lessons Learned
Bytes of character from any char set can be stored in db of any charset
EUC-JP char in WE db, in UTF8 db
bytes in db are not ‘correct’ bytes for the character in the db character set
all apps, users, dbs must know that db contains char data from other char set
Any char set conversion may corrupt the char data -- import WE into UTF8 db
Lessons Learned
Simply exporting db, importing into UTF8 does not solve the problems
Testing requires generating correctly encoded character data
Every piece of an application makes a decision about character set (default)
If all data in db really is in the db char set
export, import to db of other char set works
Need to see original character data
Verify data after char set conversion