![]() |
Software
There's not much here yet. But I will eventually start putting some of my little apps here.
aConCorde
aConcorde is a multi-lingual concordance tool.
Originally developed for native Arabic concordance,
it posses basic concordance functionality, as well
as English and Arabic interfaces. Written in Java,
so will run on any platform that has the Java
Runtime Environment installed. Go to the aConCorde project
homepage for full details.
jTokeniser
jTokeniser is a set of classes that provide a variety of tokenisers
for your Java projects. Simple tokenisers such as WhiteSpaceTokeniser or StringTokeniser
provide basic token extraction whereas RegexTokeniser and BreakIteratorTokeniser give more
advantage possibilities for more thorough tokenisers that discard punctuation too. Recent
additions include RegexSeparatorTokeniser that allows complex definition of token delimiters.
Also a SentenceTokeniser has been provided for segmenting text into a set of sentences.
jTokeniser project
homepage for full details.
JBootCat
JBootCat is a Java implemention of the BootCat scripts written by Marco Baroni et al for generating corpora from the Internet. The main goal is to encapsulate the BootCat functionality within a user-friendly desktop application. The advantage of using the Java platform is that JBootCat can be run easily on most major operating systems.
Jacman
Jacman is a GUI frontend to the excellent pacman software management software that comes with the equally excellent ArchLinux.
buckwalter2unicode.py
A fairly simple Python script designed to convert
Arabic text, that is written using Buckwalter's
transliteration system, to a Unicode encoding. Also
supports the reverse direction, i.e., Unicode to
Buckwalter. Requires Python. Released under the GPL.
LAPD - Language Analysis for Plagiarism Detection
This is a rather crude implementation in Java.
Written by myself and Alex Morrison, for a piece of
coursework during our degrees. Its purpose is to
detect cheating within two pieces of natural
language text. It compares trigrams from two source
files and if there is enough overlap, it is
considered plagiarism. The software does also
compute many stylistic statisitcs, although time ran
out and we didn't have time to factor these in to
detection procedure.
