Buckwalter2unicode

Download

binary: buckwalter2unicode-2.0.zip
source: buckwalter2unicode @ GitHub

Description

This buckwalter2unicode script is designed to convert Arabic text that has been transliterated to ASCII symbols using the Buckwalter Transliterion method. (See http://www.qamus.org/transliteration.htm for full details of his approach). The output is Arabic text in the Unicode encoding.

There is also the option to perform the transliteration is reverse, that is, taking a Unicode file containing Arabic, and converting to ASCII symbols.

Setting up

buckwalter2unicode is programmed in Python. Python is an interpreted language, and therefore you will need the Python software installed on your system before you can run this script. This program was written and tested on Python 2.2.2 only. I am pretty sure that due to Unicode issues, it will not run on any version less than 2.2. At the time of writing Python 2.3.4 is available.

Python is available for most major platform, including Microsft Windows, Linux (and other UNIX platforms), Mac OS X, etc. Look to the official Python documentation for details on how to download and install Python on your system.

Other than Python, there are no other dependencies.

How to use

Open a command-prompt. Move to the directory where the script has been saved. Then type:

 python buckwalter2unicode.py -h

This will invoke an instance of Python, which will then execute the code in the buckwalter2unicode.py file. The `-h' argument tells the script to output the possible options for using this program. It will look like this:

Usage: buckwalter2unicode.py -i INFILE -o OUTFILE [-g CHARS -r -e INPUT_ENCODING, -E OUTPUT ENCODING]
       buckwalter2unicode.py -l
       buckwalter2unicode.py -h

  -i, --input:
    Path to text file to be transliterated to Unicode.

  -o, --output:
    Path of file to output the newly transliterated text.

  -e, --input-encoding:
    Specify the text encoding of the source file. Default: latin_1.

  -E, --output-encoding:
    Specify the text encoding of the target file. Default: utf_8.

  -g CHARS, --ignore-lines=CHARS:
    Will not transliterate lines that start with any of the CHARS
    given. E.g., -g #; will not alter lines starting with # or ;.
    (May need to be -g \#\; on some platforms. See README.txt.

  -c RANGE, --columns=RANGE:
    If in columns, select columns to apply transliteration. Can be
    comma separated numbers, or a range. E.g., -c 1, -c 1-3, -c 1,3.

  -d CHAR, --delimiter=CHAR:
    Specify the delimiter that defines the column if using the -c
    option above. Default is ' ' (space).

  -r, --reverse:
    Reverses the transliteration, i.e., Arabic to Buckwalter.
    When used, it will change the default input encoding to utf_8 and
    output encoding to latin_1

  -l, --list-encodings:
    Displays all supported file encodings.

  -h, --help:
    Displays this page.

Examples

You have a file called arabic.txt that contains Arabic that has been written as ASCII symbols under the Buckwalter system. Using:

 $ python buckwalter2unicode.py -i arabic.txt -o arabic_utf8.txt

Will convert the input to proper Arabic using the Unicode UTF-8 encoding.

If you know actual encoding of the input file, then you can specify using the `-e' flag. E.g., if arabic.txt was written on a European version of Windows, then it is likely to be encoding with Windows CodePage 1252. To use this information:

 $ python buckwalter2unicode.py -i arabic.txt -e windows-1252 -o arabic_utf8.txt

If you would like to use a different Unicode standard, such as UTF-16 for the output, then this can be specified with the `-E' flag as follows:

 $ python buckwalter2unicode.py -i arabic.txt -e windows-1252 -o arabic_utf8.txt -E utf_16

Specifying encodings is optional. The default encodings are latin_1 (also known as ISO-8859-1) for input, and UTF-8 for output.

If you want to perform Unicode -> Buckwalter transliteration, you must supply the `-r' flag. You still need to specify the input and output filenames, and any encoding information if appropriate. Note, because of the reverse direction, then the encoding defaults are also swapped. The input file is assumed to be a Unicode file, therefore UTF-8 is the default. For output, as it's converting to ASCII, the latin_1 encoding will be used.

The `-l' option can be used that will simply print to screen all the available encodings that Python currently supports. The list is quite long. This list will show the `codec' name for each encoding. This is basically the name Python uses to refer to encoding names. There is then a column of aliases, as often there are slight permutations of the name, e.g., `utf_16' is a codec, and `utf16' and `U16' are aliases. A final column gives a brief description of the language(s) supported by that particular encoding.

Ignoring certain lines

The -g option allows you to specify a string of letters/symbols. If any of the lines in the input file begin with any of the characters given, then no transliteration will be applied to that line, and it will be output verbatim. This is useful for files that lines that contain comments about subsequent data, as the transliteration would most likely turn them into meaningless strings. Often these lines are annotated with a symbol to indicate that they are comments. E.g.,

 -g #$

This translates to lines beginning with either # or $ will not be transliterated. It does NOT mean lines beginning with #$. Unfortunately, multi-character conditions are not supported. (Yet!)

NOTE: Depending on your platform and command-prompt shell, certain symbols have special meanings to that shell. For example, I use the Linux platform with the Bash shell. The # symbol has a special meaning with this shell. In this event you will have to `escape' the symbol. This is done by prefixing your chosen symbol with a backslash. If I want to get the above example working correctly, I need to use the following:

 -g \#$

Typically, you are aware when you need to use escaped symbols. If when you enter the command with literal symbols and they have special significance, the program will abort and you will see the usage information of the program.

I have not been able to test this on all platforms, however, for my setup with Bash on Linux, I know that the following set of symbols need to be escaped before they will work correctly:

`
#
"
&
(
)
;
'
|
<
>
\

If you need to use any of the above symbols, then simply prefix with a backslash (`\').

Transliterating certain columns only

Imagine you have a file that looks as follows:

ktb katab write
ktb kitAb book
mktb makotab library
...

The format is:

unvowelled_word vowelled_word english_translation

Three columns, each separated (delimited) by a space.

If you were to pass this through the transliteration system, then by default, it would convert the entire line, including the English words in the final column. The output would look fine for the first two columns, but you can imagine that the translation will look unrecognisible.

It is possible to specify which columns to transliterate if you wish to be specific. (There is of course the requirement that there must be something that will consistently allow the code to identify the column boundaries. In this instance, all columns are delimited by a space.)

If you add the -c RANGE option at the command line, you can tell the program which columns are to be transliterated. The RANGE can be described as a comma separated sequence of column numbers (`1,2'), a column range (`1-2') or a combination of both (`1,2,3-5,9'). Note, 1 equals the first column and so on.

Therefore, for the above example, we'd want:

 -c 1-2

Setting the delimeter

In the above previous example, the script seemed to `know' where the column boundaries were. This was actually coincidence: the default delimiter is a space character (' '), which just happened to be the same as the delimiter of the example. Take the similar example:

ktb;katab;write
ktb;kitAb;book
mktb;makotab;library

The only difference here is that the delimiter is a semi-colon (;) rather than a space. However, the program's default is a space. To override this default, use the -d option as follows:

 -c 1-2 -d \;

As mentioned earlier, using the semi-colon at the command-line may require you to escape using the backslash.

This option only accepts single character delimiters. If you were to input more than one character, the script will automatically strip away additional ones.

Note: do not try to use a symbol that also appears in the Buckwalter tranliteration character set. It will obviously cause huge problems. The program will abort if you try this.

FYI: The space character is actually not the best choice of delimiters for textual data, as sometimes spaces may be part of a phrase that is not meant to be confused with a column separator. For example, take the following similar example:

kAtb kAtab correspond with ktb
ktAb kitAb piece of writing ktb

The format is slightly different, in that there is an additional column that contains the root of the preceeding Arabic words. However, you will notice that the translations happen to also contain spaces. This now presents a problem, in that for the first line, columns 1,2,5 are the ones to be transliterated, yet for the second line it is 1,2,6! It is no longer possible to specify a consistent range. If an alternative delimiter was used within the source file itself, it would be easy, e.g.,

kAtb	kAtab	correspond with	ktb
ktAb	kitAb	piece of writing	ktb

The above example now uses tabs. Tabs are typically represented as \t. But to use on the command-line you will need a double backslash:

 -d \\t

A note about encodings

Some people may be slightly bewildered about the regular mentions of `encodings' in this file. For those who are unsure of what I mean, I recently saw it well explained by Joel Spolsky on his website. Joel is a software engineer, and despite the title of his article being "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)", I believe that you don't have to be a programmer to understand it. You can find this article at: http://www.joelonsoftware.com/articles/Unicode.html

License information

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA