![]() |
JBootCat
Overview
JBootCat is a Java implemention of the BootCat scripts written by Marco Baroni et al for generating corpora from the Internet. The main goal is to encapsulate the BootCat functionality within a user-friendly desktop application. The advantage of using the Java platform is that JBootCat can be run easily on most major operating systems.
JBootCat is free and open source. It is released under the LGPL licence.
News
Features
JBootCat is still some way off implementing the full Bootcat functionality and the user interface is still a little rough around the edges. However, v0.2 is still capable of downloading a corpus from the Internet via Google.
A summary of features include:
- Step-by-step "wizard" interface - review each step of the process
- Enter "seeds" direct or load from a file (and save to file for future).
- Generate "tuples" directly or load from a file (and save to file for future).
- Queries Google's massive online index to obtain relevant web pages (only HTML pages supported at the moment).
- HTML clenser and advanced tokeniser (courtesy of jTokeniser).
- URL review
- Selected URLs downloaded to text file (using BootCat's "Raw" format) and saved as UTF8.
- Multi-platform - runs on any computer with Java installed.
- Free and Open Source (LGPL)
Screenshots
Download
JBootCat is free and open source. It is released under the LGPL licence.
Please select the version you wish to use:
| 0.2 | jbootcat-0.2 | README | jbootcat-0.2_src.zip [source code] |
More details can be seen in the changelog.
NOTE: Before you can utilise jTokeniser on your system, you must ensure that the Java Runtime Envinronment (version 1.5/5.0 or greater) is also installed on your computer. This can be easily downloaded from http://java.sun.com and installed.
Documentation
There's no documentation at the moment. Please be patient. Hopefully JBootCat is so simple none is required!
Acknowledgements
Firstly, my sincerest thanks to Marco Baroni for granting me permission to use the "BootCat" name, not to mention the great work (with his colleagues) developing the BootCat methodolgy in the first place. Please visit the BootCat page to learn more about it.
From the programming perspective, my progress would have been severely hindered if it weren't for the following Free and Open Source projects:
- Netbeans IDE - a great up-and-coming Java IDE.
- JTidy - Andy Quick's Java port of HTML Tidy which is great for extracting the content from HTML files.
- Wizard framework - a nifty spin-off from Netbeans that enables easy creation of step-by-step "wizard" GUI interfaces.
- Launch4j - a nice utility for creating Windows executables to launch Java applications. (Keeps the Windows users happier!)
Let's not also forget a key component, which is unfortunately not open source:
- Google API - Google's official library that allows JBootCat to query Google.
< Back to software
