If you are lucky enough, you should be using ruby 1.9 right now. In that case this post makes no sense.
But if you are stuck in an older version of ruby (1.8.6~1.8.7) and you tried to process a Japanese (or non-single byte encoding) string, then you know what I mean.
For example
In older versions of ruby previous to 1.9, most of the string functions operate at byte level.
A horrible workaround I found very useful is to match the text with a regular expression to select the part of the text you want to process.
This is very useful for parsing web pages (specially the old ones!), which usually are a terrible combination of encodings.
But if you are stuck in an older version of ruby (1.8.6~1.8.7) and you tried to process a Japanese (or non-single byte encoding) string, then you know what I mean.
For example
text = "日本語能力試験" text.slice(0,3) # Returns the first 3 byteswill give you the first 3 bytes, not the first 3 characters as you would expect.
In older versions of ruby previous to 1.9, most of the string functions operate at byte level.
A horrible workaround I found very useful is to match the text with a regular expression to select the part of the text you want to process.
text = "日本語能力試験"
text.scan(/.{3}/)[0] # Returns the first 3 characters
This is very useful for parsing web pages (specially the old ones!), which usually are a terrible combination of encodings.
No comments:
Post a Comment