Dec 11, 2010

Splitting a Japanese String, the old Ruby way

If you are lucky enough, you should be using ruby 1.9 right now. In that case this post makes no sense.

But if you are stuck in an older version of ruby (1.8.6~1.8.7) and you tried to process a Japanese (or non-single byte encoding) string, then you know what I mean.

For example
text = "日本語能力試験"
text.slice(0,3) # Returns the first 3 bytes
will give you the first 3 bytes, not the first 3 characters as you would expect.

In older versions of ruby previous to 1.9, most of the string functions operate at byte level.

A horrible workaround I found very useful is to match the text with a regular expression to select the part of the text you want to process.

text = "日本語能力試験"
text.scan(/.{3}/)[0] # Returns the first 3 characters

This is very useful for parsing web pages (specially the old ones!), which usually are a terrible combination of encodings.




Dec 4, 2010

Getting the Number of Pages in PDF Files with Ruby

Just a brief snippet on how to get the number of pages in a PDF file.
It works most of the time for conventional PDF files (not encrypted and not protected by password).

file = File.open('myfile.pdf','rb') 
# 'rb' Required for windows!!!
text = file.read
file.close

keyword_c = text.scan(/Count\s+(\d+)/).size
keyword_t = text.scan(/\/Type\s*\/Page[^s]/).size

pages = keyword_c > keyword_t ? keyword_c : keyword_t

puts "Total pages: #{pages}"





Dec 3, 2010

Clouding your Git Repo in Amazon S3 under Windows

Recently I tried to setup a Git repository in Amazon S3.
I found it pretty straightforward to do in Linux and Mac using JGit.
But I also needed to access my repo from a Windows box.
I googled a little bit, but most of the solutions where made using JungleDisk or Cloudberry.
I found an interesting link about JGit on Windows
"Using Git on Windows without any of the Cygwin/msysgit nonsense"
and I decided to try it in combination with Git and S3.

I tried just once, and it worked!

Here is the procedure I followed:

# Download and install "msysgit"
(currently Git-1.7.3.1-preview20101002.exe):
http://code.google.com/p/msysgit/

# Download and install "jgit.sh"
(currently org.eclipse.jgit.pgm-0.9.3.sh):
http://eclipse.org/jgit/download/

# Download and install Java (probably already installed):
http://www.java.com/en/download/index.jsp


# Rename "org.eclipse.jgit.pgm-0.9.3.sh" to "jgit" and put it in a directory "JGit"
# Create a batch file "jgit.bat" in the same directory with the following content:
java -cp jgit org.eclipse.jgit.pgm.Main %1 %2 %3 %4

# Add the path of the batch file to the environment variables so it can be found
in the command line ( System Properties -> Environment Variables).

# From the msysgit console test "jgit"
$ jgit version
jgit version 0.9.3

# Create a local git repository.
# In the Git console:

$ mkdir project
$ cd project
$ echo "initial test" > README
$ git init
$ git add .
$ git commit -m "initial commit"
$ git status

# Add your Amazon S3 keys accessible to "jgit":

$ touch ~/.jgit
$ notepad ~/.jgit
accesskey: xxx
secretkey: yyy

# Add the remote repository in the Amazon S3 bucket "yourbucket"
(previously created with some tool like S3Fox in Firefox)
$ git remote add s3 amazon-s3://.jgit@yourbucket/project.git

# Push your local repository to the bucket
$ jgit push s3 master

# To clone the repository you can do
$ jgit clone amazon-s3://.jgit@yourbucket/project.git

# To check
$ git remote -v

# If you want to keep consistency with the names:
$ git remote rename origin s3

# To commit your changes as usual:
$ git commit -a "updated repo"

# To push to your changes to your S3 bucket:
$ jgit push s3 master

# Because "jgit" doesn't have the pull command you have to split the steps:
$ jgit fetch s3
$ git merge s3/master

That's it. You can have access to your Git repo in Amazon S3 from a Windows box.

For Mac OS X and Linux you can check these blogs, from those I took some of the previous steps:
http://gabrito.com/post/storing-git-repositories-in-amazon-s3-for-high-availability
http://blog.spearce.org/2008/07/using-jgit-to-publish-on-amazon-s3.html




Dec 2, 2010

"hello, world"

main( ) {
        printf("hello, world");
}