Bit Paragon: Splitting a Japanese String, the old Ruby way

If you are lucky enough, you should be using ruby 1.9 right now. In that case this post makes no sense.

But if you are stuck in an older version of ruby (1.8.6~1.8.7) and you tried to process a Japanese (or non-single byte encoding) string, then you know what I mean.

For example

text = "日本語能力試験"
text.slice(0,3) # Returns the first 3 bytes

will give you the first 3 bytes, not the first 3 characters as you would expect.

In older versions of ruby previous to 1.9, most of the string functions operate at byte level.

A horrible workaround I found very useful is to match the text with a regular expression to select the part of the text you want to process.

text = "日本語能力試験"
text.scan(/.{3}/)[0] # Returns the first 3 characters

This is very useful for parsing web pages (specially the old ones!), which usually are a terrible combination of encodings.

Bit Paragon

Dec 11, 2010

Splitting a Japanese String, the old Ruby way

No comments:

Post a Comment