Wednesday, November 28, 2007

Survival Guide to Internationalization (I18N)

As software engineers, we typically don't bother with many "things you should design for" until they become a necessity. Things like security and internationalization fall into that bucket. The agilists would say, "Of course, stupid, don't build it until you have a business needs for it" (sorry for my poor paraphrasing of a well-thought out paradigm, but you get the picture). The older and less un-wise I get, that advice of the agilists' sounds more and more correct. Alas, when the day comes to tackle the issues that the software gods (of old) admonished us to (security, I18N), it's time to jump in and rip a new hole in your domain logic.

As regards I18N and the purpose of this blog entry, I recently completed a major upgrade for a major U.S. bank on their loyalty card reward program web site. It's essentially an e-commerce site with a product catalog and a checkout flow where you redeem earned loyalty points for goods/events. One of the primary requirements was to "create a Spanish version of the site" (it was previously only in English). I have some background with I18N/L10N (that is, localization) and knew some of the basics of the problem space (learn to love Unicode - a disputable assertion in some parts of East Asia -, message properties files, ISO language codes, and so on), so I dove in and learned as much as I could and still get paid for it.

What follows is a synopsis of all the technical decisions I made to get the beast up and running. I'll offer some tangential advice along the way and point out some gotchas and other merriments I discovered along the way. To give some sort of form to this description, and also to trace my approach to the problem, I'll start all the way at the back-end database and move up through the layers to the JSP/HTML rendering layer.

Database
As the database we always use is Oracle, I was able to get some built-in features for free. The biggest was that Oracle supports UTF-8 as the default encoding for a schema; I remember a few years back that MySQL was only supporting ACSII, more or less, but that may have changed with the 5.x version. So, when creating the Oracle schema, you need to make the following settings:
  • NLS_CHARACTERSET = UTF8
  • NLS_NCHAR_CHARACTERSET = UTF8
The NCHAR character set acts like a secondary character encoding for a schema. From what I remember, though, it's largely deprecated. Luckily for me, my DBAs already set up databases and schemas to use UTF-8, so it was easy for me (unfortunately, I don't have an example to show here, though). To confirm the schema parameters you can execute the following query:

select * from v$nls_parameters;

As to clients who actually want to retrieve anything meaningful from the database, the executing Java process must the following environment variable defined:

NLS_LANG=AMERICAN_AMERICA.UTF8

Note that you can only set this as an environment variable and not as any JVM argument (I tried many, many times to get it to work with JVM args - why not, Oracle?).

Now that the database data dictionary is in place, you'll actually want make the data within your database retrievable in a locale-specific manner, so here are a few suggestions. If you are working with a normalized data schema, you will probably have a product table that starts off like this:




Pretty straight-forward. Now, a possibility for making locale-specific data can be this (click the image for a larger view):



We've moved the user-displayable data into the product_details table and made it local-specific by maintaining a foreign key to the locales table, which is a lookup table for all of the system-defined locales (which may or may not correspond to the ISO spec - but why not?).

This is the model I went with as it seemed like if I was going to continue with a normalized schema, may as well go whole hog. Of course, there's many pros and cons to this proposal, but I the main impetus for me on this project was to maintain a standard approach to database design, else you get mixed styles and nothing feels right (or cohesive). Looking back, as this product service is mainly a read-only sink, I'd probably do a bit (or a lot) of denormalization to really boost read performance - but couldn't a well-behaved/properly-designed cache net me the same performance boost/load reduce? Ahh, digressions....

On the other hand, for those rugged individuals who are moving to (or starting) working in a denormalized schema, if you start with the "before" version of table definition from above, the table below would be the "after", localized version.



Essentially, for every locale-variant of the product you will have another row in the table. Hence, you will need to drop the uniqueness constraint on the SKU column, as well as adding in locale-identifying fields (ISO_*_CODE columns).

Streams and Strings and character arrays, oh my...

Here's a little bit of handy advice: when using any kind of java.io class (like a reader/writer, or input/output stream), one thing you MUST always do is set a character encoding (if the API allows), and always set it to "UTF-8". For example, when you need to get the byte data from a String (to write to an output stream, for example), always use the overloaded operation for String.getBytes() that takes a string encoding parameter:

byte[] b = myString.getBytes("UTF-8");

If you do not do this for streams, according to the JavaDocs, the JVM will use the 'platform default' for encoding/decoding character data. Most likely, this will be ACSII, or at best, ISO-LATIN-1. Hence, all your well-laid plans for I18N will get shot to hell if even one subcomponent fails to do this. The biggest problem, of course, is tracking the source of the problem.

Another thing, don't trust log files or standard output as those already have their own streams that are most likely not setup to help you out when debugging encoding issues (I'm looking at you, log4j). You can, of course, write out a straight binary dump of a String or XML file to a flat file on your disk and examine it with a hex editor (and yeah, I've been there - ouch!) Unfortunately, this is the only way to absolutely ensure that data is correctly UTF-8 encoded sometimes.

Service

As our enterprise moved to a SOA style of distributed services, the product service stands between the client web application and the database. Luckily, my web service environment, Axis2, already handles all streams as UTF-8 encoded. What was interesting, however, was deciding whether or not to expose the increased complexity in the database to the domain objects and ultimately the service interface. To save a rather lengthy and not altogether thrilling discussion, I choose to isolate the complexity to the database (and Hibernate mappings/DAO layer) and keep the domain objects and service interface largely untouched. The client essentially just needs to indicate if it wants a locale-specific version of a given product, and if the product/localization combination exists, that "limited-scope view" of the product is returned. There are several secondary-level and edge cases to worry about, but this is already getting to be a rather lengthy entry.

Web application

The closer we get to the top, luckily, the going gets much easier as all the hard lessons should percolate upward wuite nicely. The biggest thing to be aware of with the web-app is to ensure all incoming data from clients, either as form data or query parameters, is to ensure that the stream is read as UTF-8 characters (this is also known as URIEncoding). I did this rather cheaply in a servlet filter; the interesting part of which is:

httpServletRequest.setCharacterEncoding("UTF-8");


JSPs/HTML pages

The client/browser needs to be made aware of the content type coming at it. This can be set in several ways:
  • as HTTP response header: Content-Type: "your_mime_type/your_mime_type; char-set=UTF-8"
  • in a flat HTML document, within the head/meta element:
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    </head>
    ....
    </html>
  • lastly, in a JSP:
    <%@page pageEncoding="UTF-8" contentType="text/html; charset=UTF-8"%>

Ajax call

Alright, now that we've come all this way, we actually want to render a product on the page. Surprise, surprise, the product data is not returned as part of the normal page load, but instead via an Ajax call. We, as an e-commerce team, were, and to a degree still are, in the process of reevaluating how the front-end code (that is, what the JavaScript and HTML cats work with) interacts with objects coming from a service (products, campaigns, and so on). So, in the spirit of copying what all the cool kids are doing, we decided to call a servlet (an extension of our product service, although appendage may be a more apropos label) to load the product data, via shared domain logic from the service proper, and return it as a JSON string.

Admittedly, this JSON servlet is kind of a bastardized REST endpoint (meaning, it's not really REST, but I'll lie to myself, anyways). I had initially gone down the road of creating a fully RESTful web service ala Resource Oriented Architecture (ROA), but then two things got in my: the need to pass multiple meta-data points in addition to the main resource URI (which, for space considerations, I'll add to a seperate blog entry), as well as an organizational one: my fellow architects at my company, one in particular, was quite opposed to the notion of ROA, he and I sparred for a few days, then I gave in because I had a project to finish. That being said, I still incorporated many of the ideals, if not actual practices, into my bastardized REST service.

A couple of choices I made about the design of the "REST" service are:
  • input
    • put a version tag in the URI itself
    • added the locale information as HTTP params (see the companion blog entry about my confilcts on this issue)
  • output
    • once again, make sure you write a UTF-8 encoded stream
    • set the Content-Type HTTP response header
    • setting the Content-Length HTTP header works great for debugging
Also, when testing out RESTful endpoints, the Unix tool curl was insanely helpful to do sanity checks when testing the endpoint (in any odev, qa, or prod environment), as it's hard as hell to debug the output of a servlet when you are setting headers, setting response codes - ya know, all those REST-type things we should be using now that we've all rediscovered the Joy of HTTP. You could use FireBug in FireFox, but that's just too damned easy.

Whew, that's a lot information and if you've made it this far, it's time for a (strong) drink. This entry is really the culmination of about three months of work, but many of the ideas had been fermenting for a long time. With a little bit of reflection and hindsight, there's some things I wish had done better, more efficiently, and so on, but I feel rather proud of the project. Let me know if any of this was helpful, a complete waste of time, or if you've got some ideas you'd like to share!

As a footnote, one reference that was insanely helpful was this article on the Sun Java web site.

No comments: