Wednesday, November 28, 2007

Survival Guide to Internationalization (I18N)

As software engineers, we typically don't bother with many "things you should design for" until they become a necessity. Things like security and internationalization fall into that bucket. The agilists would say, "Of course, stupid, don't build it until you have a business needs for it" (sorry for my poor paraphrasing of a well-thought out paradigm, but you get the picture). The older and less un-wise I get, that advice of the agilists' sounds more and more correct. Alas, when the day comes to tackle the issues that the software gods (of old) admonished us to (security, I18N), it's time to jump in and rip a new hole in your domain logic.

As regards I18N and the purpose of this blog entry, I recently completed a major upgrade for a major U.S. bank on their loyalty card reward program web site. It's essentially an e-commerce site with a product catalog and a checkout flow where you redeem earned loyalty points for goods/events. One of the primary requirements was to "create a Spanish version of the site" (it was previously only in English). I have some background with I18N/L10N (that is, localization) and knew some of the basics of the problem space (learn to love Unicode - a disputable assertion in some parts of East Asia -, message properties files, ISO language codes, and so on), so I dove in and learned as much as I could and still get paid for it.

What follows is a synopsis of all the technical decisions I made to get the beast up and running. I'll offer some tangential advice along the way and point out some gotchas and other merriments I discovered along the way. To give some sort of form to this description, and also to trace my approach to the problem, I'll start all the way at the back-end database and move up through the layers to the JSP/HTML rendering layer.

Database
As the database we always use is Oracle, I was able to get some built-in features for free. The biggest was that Oracle supports UTF-8 as the default encoding for a schema; I remember a few years back that MySQL was only supporting ACSII, more or less, but that may have changed with the 5.x version. So, when creating the Oracle schema, you need to make the following settings:
  • NLS_CHARACTERSET = UTF8
  • NLS_NCHAR_CHARACTERSET = UTF8
The NCHAR character set acts like a secondary character encoding for a schema. From what I remember, though, it's largely deprecated. Luckily for me, my DBAs already set up databases and schemas to use UTF-8, so it was easy for me (unfortunately, I don't have an example to show here, though). To confirm the schema parameters you can execute the following query:

select * from v$nls_parameters;

As to clients who actually want to retrieve anything meaningful from the database, the executing Java process must the following environment variable defined:

NLS_LANG=AMERICAN_AMERICA.UTF8

Note that you can only set this as an environment variable and not as any JVM argument (I tried many, many times to get it to work with JVM args - why not, Oracle?).

Now that the database data dictionary is in place, you'll actually want make the data within your database retrievable in a locale-specific manner, so here are a few suggestions. If you are working with a normalized data schema, you will probably have a product table that starts off like this:




Pretty straight-forward. Now, a possibility for making locale-specific data can be this (click the image for a larger view):



We've moved the user-displayable data into the product_details table and made it local-specific by maintaining a foreign key to the locales table, which is a lookup table for all of the system-defined locales (which may or may not correspond to the ISO spec - but why not?).

This is the model I went with as it seemed like if I was going to continue with a normalized schema, may as well go whole hog. Of course, there's many pros and cons to this proposal, but I the main impetus for me on this project was to maintain a standard approach to database design, else you get mixed styles and nothing feels right (or cohesive). Looking back, as this product service is mainly a read-only sink, I'd probably do a bit (or a lot) of denormalization to really boost read performance - but couldn't a well-behaved/properly-designed cache net me the same performance boost/load reduce? Ahh, digressions....

On the other hand, for those rugged individuals who are moving to (or starting) working in a denormalized schema, if you start with the "before" version of table definition from above, the table below would be the "after", localized version.



Essentially, for every locale-variant of the product you will have another row in the table. Hence, you will need to drop the uniqueness constraint on the SKU column, as well as adding in locale-identifying fields (ISO_*_CODE columns).

Streams and Strings and character arrays, oh my...

Here's a little bit of handy advice: when using any kind of java.io class (like a reader/writer, or input/output stream), one thing you MUST always do is set a character encoding (if the API allows), and always set it to "UTF-8". For example, when you need to get the byte data from a String (to write to an output stream, for example), always use the overloaded operation for String.getBytes() that takes a string encoding parameter:

byte[] b = myString.getBytes("UTF-8");

If you do not do this for streams, according to the JavaDocs, the JVM will use the 'platform default' for encoding/decoding character data. Most likely, this will be ACSII, or at best, ISO-LATIN-1. Hence, all your well-laid plans for I18N will get shot to hell if even one subcomponent fails to do this. The biggest problem, of course, is tracking the source of the problem.

Another thing, don't trust log files or standard output as those already have their own streams that are most likely not setup to help you out when debugging encoding issues (I'm looking at you, log4j). You can, of course, write out a straight binary dump of a String or XML file to a flat file on your disk and examine it with a hex editor (and yeah, I've been there - ouch!) Unfortunately, this is the only way to absolutely ensure that data is correctly UTF-8 encoded sometimes.

Service

As our enterprise moved to a SOA style of distributed services, the product service stands between the client web application and the database. Luckily, my web service environment, Axis2, already handles all streams as UTF-8 encoded. What was interesting, however, was deciding whether or not to expose the increased complexity in the database to the domain objects and ultimately the service interface. To save a rather lengthy and not altogether thrilling discussion, I choose to isolate the complexity to the database (and Hibernate mappings/DAO layer) and keep the domain objects and service interface largely untouched. The client essentially just needs to indicate if it wants a locale-specific version of a given product, and if the product/localization combination exists, that "limited-scope view" of the product is returned. There are several secondary-level and edge cases to worry about, but this is already getting to be a rather lengthy entry.

Web application

The closer we get to the top, luckily, the going gets much easier as all the hard lessons should percolate upward wuite nicely. The biggest thing to be aware of with the web-app is to ensure all incoming data from clients, either as form data or query parameters, is to ensure that the stream is read as UTF-8 characters (this is also known as URIEncoding). I did this rather cheaply in a servlet filter; the interesting part of which is:

httpServletRequest.setCharacterEncoding("UTF-8");


JSPs/HTML pages

The client/browser needs to be made aware of the content type coming at it. This can be set in several ways:
  • as HTTP response header: Content-Type: "your_mime_type/your_mime_type; char-set=UTF-8"
  • in a flat HTML document, within the head/meta element:
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    </head>
    ....
    </html>
  • lastly, in a JSP:
    <%@page pageEncoding="UTF-8" contentType="text/html; charset=UTF-8"%>

Ajax call

Alright, now that we've come all this way, we actually want to render a product on the page. Surprise, surprise, the product data is not returned as part of the normal page load, but instead via an Ajax call. We, as an e-commerce team, were, and to a degree still are, in the process of reevaluating how the front-end code (that is, what the JavaScript and HTML cats work with) interacts with objects coming from a service (products, campaigns, and so on). So, in the spirit of copying what all the cool kids are doing, we decided to call a servlet (an extension of our product service, although appendage may be a more apropos label) to load the product data, via shared domain logic from the service proper, and return it as a JSON string.

Admittedly, this JSON servlet is kind of a bastardized REST endpoint (meaning, it's not really REST, but I'll lie to myself, anyways). I had initially gone down the road of creating a fully RESTful web service ala Resource Oriented Architecture (ROA), but then two things got in my: the need to pass multiple meta-data points in addition to the main resource URI (which, for space considerations, I'll add to a seperate blog entry), as well as an organizational one: my fellow architects at my company, one in particular, was quite opposed to the notion of ROA, he and I sparred for a few days, then I gave in because I had a project to finish. That being said, I still incorporated many of the ideals, if not actual practices, into my bastardized REST service.

A couple of choices I made about the design of the "REST" service are:
  • input
    • put a version tag in the URI itself
    • added the locale information as HTTP params (see the companion blog entry about my confilcts on this issue)
  • output
    • once again, make sure you write a UTF-8 encoded stream
    • set the Content-Type HTTP response header
    • setting the Content-Length HTTP header works great for debugging
Also, when testing out RESTful endpoints, the Unix tool curl was insanely helpful to do sanity checks when testing the endpoint (in any odev, qa, or prod environment), as it's hard as hell to debug the output of a servlet when you are setting headers, setting response codes - ya know, all those REST-type things we should be using now that we've all rediscovered the Joy of HTTP. You could use FireBug in FireFox, but that's just too damned easy.

Whew, that's a lot information and if you've made it this far, it's time for a (strong) drink. This entry is really the culmination of about three months of work, but many of the ideas had been fermenting for a long time. With a little bit of reflection and hindsight, there's some things I wish had done better, more efficiently, and so on, but I feel rather proud of the project. Let me know if any of this was helpful, a complete waste of time, or if you've got some ideas you'd like to share!

As a footnote, one reference that was insanely helpful was this article on the Sun Java web site.

Meta-data and RESTful web services

I'd like to think out loud about REST and URIs and ROA. The concepts behind REST are ideas I am really thinking about lately, but there some issues I find in reticifying it with my current software needs. Let's say I have a URL/URI:

http://www.my_domain.com/item/123

which represents the addressable location of the item I am interested in (uniquely, and cleverly, identified by the identifier "123"). What happens when my service undergoes many updates and versioning needs to be introduced? I could add a version tag into the URI like this:

http://www.my_domain.com/item/v-1.2/123

Although the question remains: is this a version of the resource, or of the mechnism via which I access that resource? Let's leave that question alone and move on to more practical matters.

Let's say I now need to indicate a return (mime) type. The resource may be available as a JSON string or an XML document. Naturally, or at least what we've been trained to do, is to stick a well-known extension to the end of the resource name, as in

http://www.my_domain.com/item/123.xml

and if I combine that with my versioning:

http://www.my_domain.com/item/v-1.2/123.xml

OK, this seems relatively reasonable, but what happens when I introduce another piece of meta-data, say locale? Where, on my last URI, do I "RESTfully" add this meta-data? After the mime extension?

http://www.my_domain.com/item/v-1.2/123.xml.en_US

in the URI itself?

http://www.my_domain.com/item/v-1.2/en_US/123.xml

This version seems a little weird, so let's skip that one and just consider locale at the end of the URI.

If we choose a convention for location of these points of meta-data and stick with it, we could be OK. But then what happens when doesn't care or assumes a default resonse type and drops the mime extension? The URI could becomes

http://www.my_domain.com/item/v-1.2/123.en_US

So now the service needs to figure out if the "en_US" is a document/mime type or a locale definition. Yeah, you could create a bas-ass regex engine to derive all the possible variants, but would you really want to?

I think I could go on and on in this vein, adding in more meta-data points (although I don't really have any more at this time), but suffice it to say that if your needs are simple (you don't care about versioning or locale or any of that other accoutrement), you could probably go through life quite happy with our original URI

http://www.my_domain.com/item/123

But when you need more expressiveness, I'm just not sure if making the URI bend over backwards (huge caveat) without creating your own, custom URI rules/syntax is going to get the job done in a reasonable, RESTful manner. I'm just not sure...

Tuesday, November 20, 2007

Test driving Scala

At the QCon conference in San Francisco two weeks ago, I heard Brian Goetz speak about concurrency, it's past and where we're at now. It was a really great session (I should probably blog about it in the future), but one of the problems he identified with the concurrency in Java is the notion of shared state between threads. Long story short, some alternate approaches to concurrency and safety include the techniques that functional languages like Erlang and Scala are taking.

In the interest of tooling around late night after my wife and son have fallen asleep , I've been poking around with Scala. I'm just getting started really, but one of the nice things they have included in the basic download kit, available from the scala-lang site, is an interactive command line console. It works very similar to ruby's irb. Type in a command, and watch it instantly get pissed off about your syntax! Very handy for getting your feet wet - in between late night diaper changes.

More info later. Will probably play around some more on my trip to India next week, as I'll be there for two weeks.

** UPDATE: Just in case you are crazy enough to develop on open solaris (on an x86 laptop, nonetheless), you will need to hack the shell scripts that launch the various command line utilities. Basically, if you aren't doing anything nuts, you can just yank the setting of the JAVA_OPTS env var where it invokes the JVM (typically on the last line of the line). I just kept the min/max JVM sizes as is (-Xmx, -Xms). Cheap and easy, but it works!

There is an open ticket on the scala site as a known issue, but this hack will get you out of bash complaining "bad interpreter" everytime you try to launch any scala command.

Monday, November 12, 2007

Backwards compatibility in services

With all the hoopla over web services (including WS-* and REST varieties) and asynchronous messaging, there seems to be little discussion on how to make these services (I'm using the term very generically here) compatible with existing service consumers. This issue is really not a problem for vendors or tools providers to solve, as it really pertains to the problem domain at hand and how the engineers and architects want to manage it. In other words, it's yer own damn problem.

After speaking with engineers and architects about how they approach service evolution, there seem to be some approaches that are common:
  • the no-harm, no-foul approach
  • the big-bang approach
  • the multiple service instances approach
  • the maintain-it-yourself approach
First up, the no-harm, no-foul approach (a/k/a close-your-eyes-and-hope-everything-is-cool) updates the services and expects the clients to "just work". Funny, life just doesn't work out the nicely. What frequently happens is that after after the new service code rolls into qa (or, god forbid, production), clients start breaking - and then it's either revert the service or update the rogue clients.

Second, the big-bang approach is to update the service and clients at the same time. While certainly handy, you need to be in control of both the service and all of the clients, then have qa and systems teams who can accommodate this approach. If all the clients are internal, this might not be a bad approach necessarily, but that point notwithstanding, I've found that either not all the clients get properly updated (either a code migration problem or forgetting to update some client's code) or, more likely, not every client was tested thoroughly.

The multiple service instances approach is perhaps the most deployment I've seen. When you need to release a new version of the service, just deploy a new web-app with the updated code and a new URL (typically, a version number will appear in the URL itself). On the surface, this works rather well as old services are not removed (hence, their clients keep on working as they should), but I feel there are two major drawbacks to this. The first is the database. Assuming the service uses a relational database (as most do), database schema are not backwards compatible (unless you really work at it - triggers, synonyms, and so on. Check out Scott Ambler's Refactoring Databases: Evolutionary Database Design; it's truly brilliant). The second problem is resource consumption - memory, disk space, power, etc. While appearing most obviously for small shops, many medium to large enterprises can ignore these problems. However, large industry players like Dan Pritchett, architect at eBay, are now intimately concerned with these issues as they build out more and more redundant data centers.

Lastly, the maintain-it-yourself approach deploys just a single version of the service (the most recent one), and it is able to successfully process and reply to clients of at all version levels. Typically, a service version is indicated somewhere by the client (in the URL, XML namespace, and so on), then the service uses that piece of information of aggregate the incoming parameters as needed in a version-specific manner, perform its normal functions at the current operating version level, and finally return the result to the client in the version they are expecting. This approach clearly has impacts on the engineers that need to create new functionality yet maintain backwards compatibility, and that is the biggest cost. It could be argued that the processing cycles required to calculate requested version, convert up, process, then convert down could be a drag on the performance. I tend to think that with everything else going on in a service, this cost should not be a significant factor - or something else squirrly may be going on. This approach does, however, allow the database to evolve at a natural pace, and this cannot be understated (at least, while we use databases in the manner in which we have traditionally).

As a full disclaimer, I have, consciously or unconsciously, used all four of the these approaches at one time or another - and in the order of discovery as laid out above. After the initial birth of the body of services, I naturally needed to keep adding functionality. After making a whole lot of mistakes (I screwed up the production servers/services/web-app and databases several times, ) in the process of maturing the services, the best solution for my system was to maintain backwards compatibility for all clients. Granted, because we have quite a good view of what/who all the clients of the services are, so we can be a bit more proactive in phasing out the old client implementations.