Skip to main content
   
Andrew Roberts @ School of Computing

JBootCat

Overview

JBootCat is a Java implemention of the BootCat scripts written by Marco Baroni et al for generating corpora from the Internet. The main goal is to encapsulate the BootCat functionality within a user-friendly desktop application. The advantage of using the Java platform is that JBootCat can be run easily on most major operating systems.

JBootCat is free and open source. It is released under the LGPL licence.

News

14th July 2006 - JBootCat v0.2 released
I'm pleased to announce the first public release of JBootCat. As you may appreciate from the version number that there is still a great deal of work to be done before JBootCat is fully functional. However, I have implemented a sufficient amount to allow the user to download a corpus from a set of seeds using the BootCat approach.
See the screenshots below.

Features

JBootCat is still some way off implementing the full Bootcat functionality and the user interface is still a little rough around the edges. However, v0.2 is still capable of downloading a corpus from the Internet via Google.

A summary of features include:

  • Step-by-step "wizard" interface - review each step of the process
  • Enter "seeds" direct or load from a file (and save to file for future).
  • Generate "tuples" directly or load from a file (and save to file for future).
  • Queries Google's massive online index to obtain relevant web pages (only HTML pages supported at the moment).
  • HTML clenser and advanced tokeniser (courtesy of jTokeniser).
  • URL review
  • Selected URLs downloaded to text file (using BootCat's "Raw" format) and saved as UTF8.
  • Multi-platform - runs on any computer with Java installed.
  • Free and Open Source (LGPL)

Screenshots

Download

JBootCat is free and open source. It is released under the LGPL licence.

Please select the version you wish to use:

0.2 jbootcat-0.2 README jbootcat-0.2_src.zip [source code]

More details can be seen in the changelog.

NOTE: Before you can utilise jTokeniser on your system, you must ensure that the Java Runtime Envinronment (version 1.5/5.0 or greater) is also installed on your computer. This can be easily downloaded from http://java.sun.com and installed.

Documentation

There's no documentation at the moment. Please be patient. Hopefully JBootCat is so simple none is required!

Acknowledgements

Firstly, my sincerest thanks to Marco Baroni for granting me permission to use the "BootCat" name, not to mention the great work (with his colleagues) developing the BootCat methodolgy in the first place. Please visit the BootCat page to learn more about it.

From the programming perspective, my progress would have been severely hindered if it weren't for the following Free and Open Source projects:

  • Netbeans IDE - a great up-and-coming Java IDE.
  • JTidy - Andy Quick's Java port of HTML Tidy which is great for extracting the content from HTML files.
  • Wizard framework - a nifty spin-off from Netbeans that enables easy creation of step-by-step "wizard" GUI interfaces.
  • Launch4j - a nice utility for creating Windows executables to launch Java applications. (Keeps the Windows users happier!)

Let's not also forget a key component, which is unfortunately not open source:

  • Google API - Google's official library that allows JBootCat to query Google.

Webstats4U - Free web site statistics Personal homepage website counter
< Back to software