Welcome

The JTopas project provides a small, easy-to-use Java library for the common problem of parsing arbitrary text data. These data can come from a simple configuration file with a few comments, a HTML, XML or RTF stream, source code of various programming languages etc. Sometimes a text has to be parsed completely, sometimes only parts of it are important.

While some programmers solve parsing problems by extensive use of basic methods like the C library functions strtok, strchr or the Java String class methods indexOf, substring etc. combined with lots of loops, if and switch-Statements, others may choose basic tokenizers like the Java classes StringTokenizer or StreamTokenizer. In the case of complex syntaxes like document or programming languages, parser generators like lex and yacc, JavaCC or the Java Treebuilder as well as existing parsers for a specific language are a common choice.

With JTopas, a common solution for a wide range of parsing tasks is available. The tedious work of debugging a parser modul written from scratch, is obsolete or at least significantly shortened. And for relatively simple problems, the use of parser generators is not nessecary.

Features

The JTopas classes and interfaces in their current state of developement (version 0.6) can be used for tokenizing and basic parsing tasks. A command line parser, a file reader, a IP protocol interpreter, a partial HTML parser or a tokenizer for JavaCC / JTB may be realized with JTopas. This flexibility is achieved by dynamically configurable classes and strict separation of different tasks.

These are the main characteristics of the JTopas classes and interfaces:

One or more sorts of line and / or block comments may be added and removed during runtime.
Special sequences like operators and separators with a special meaning can be dynamically added and removed.
Support for keywords is available.
Pattern matching (regular expressions) is available when used with JDK 1.4.
Data may be read from InputStream's as well as from other sources, that can be realized by implementing a data source interface.
Parsing characteristics like case-sensitivity, line and column counting and whitespace handling can be set on a global as well as on a per-item base (where a per-item base makes sense :-).
Read data may or may not be stored by the tokenizer.
The specific representation of tokens (token images) may or may not be returned by the tokenizer.
Multiple tokenizers may share one data source (embedded tokenizers).
Dynamic configuration is performed via an interface based on a Java Bean design pattern. Parties interested in modifications, can attach themselves to the software through a listener interface.
The software can be used purely in a black box manner, with few to many adaptions to the actual needs, and only as a declarative base for completely self-written Java classes.

While JTopas may be used "out of the box" by simply supplying a data source, setting the tokenizer characteristics and start tokenizing, the library also provides a rich set of service provider interfaces to implement specific handlers for keywords, patterns etc. Therefore, JTopas can easily be integrated in existing software, other contexts than straightforward parsing, multithreaded environments ...

Example

Here is an example Java program that extracts the contents of a HTML file using JTopas as it comes. It shows that the JTopas classes are independend from the protocol, dialect or language/contents to be parsed. Moreover, what is extracted in which way, can be dynamically controlled. With a few alternations, for instance, it would be possible to extract all hyperlinks or the meta informations of a HTML source. There are more examples in our JUnit test cases provided with the JTopas library.


  // This will print the contents of a HTML file as a
  // roughly formatted text

  // Imports
  import de.susebox.jtopas.Flags;
  import de.susebox.jtopas.Token;
  import de.susebox.jtopas.Tokenizer;
  import de.susebox.jtopas.StandardTokenizer;
  import de.susebox.jtopas.StandardTokenizerProperties;
  import de.susebox.jtopas.ReaderSource;

  // class to hold main method
  public class ContentsExtractor {

    // Main method. Supply a HTML file name as argument
    public static void main(String[] args) {
      try {
        // setup the tokenizer properties
        TokenizerProperties props     = new StandardTokenizerProperties();

        props.setParseFlags(  Flags.F_NO_CASE 
                            | Flags.F_TOKEN_POS_ONLY 
                            | Flags.F_RETURN_WHITESPACES);
        props.setSeparators(null);
        props.addBlockComment("<", ">");
        props.addBlockComment("<HEAD>", "</HEAD>");
        props.addBlockComment("<!--", "-->");
        props.addSpecialSequence("&lt;", "<");
        props.addSpecialSequence("&gt;", ">");            
        props.addSpecialSequence("<b>", "");
        props.addSpecialSequence("</b>", "");
        props.addSpecialSequence("<i>", "");
        props.addSpecialSequence("</i>", "");
        props.addSpecialSequence("<code>", "");
        props.addSpecialSequence("</code>", "");

        // Case-sensitive HTML elements
        props.addSpecialSequence("&auml;", "ä", 0, Flags.F_NO_CASE);
        props.addSpecialSequence("&Auml;", "Ä", 0, Flags.F_NO_CASE);
        props.addSpecialSequence("&ouml;", "ö", 0, Flags.F_NO_CASE);
        props.addSpecialSequence("&Ouml;", "Ö", 0, Flags.F_NO_CASE);
        props.addSpecialSequence("&uuml;", "ü", 0, Flags.F_NO_CASE);
        props.addSpecialSequence("&Uuml;", "Ü", 0, Flags.F_NO_CASE);
        props.addSpecialSequence("&szlig;", "ß");

        // create the tokenizer
        Tokenizer           tokenizer = new StandardTokenizer(props);
        TokenizerSource     source    = new ReaderSource(args[0]);
        Token               token;
        int                 len;

        try {
          tokenizer.setSource(new ReaderSource(args[0]));

          // tokenize the file and print basically
          // formatted context to stdout
          len = 0;
          while (tokenizer.hasMoreToken()) {
            token = tokenizer.nextToken();
            switch (token.getType()) {
            case Token.NORMAL:
              System.out.print(tokenizer.current());
              len += token.getLength();
              break;
            case Token.SPECIAL_SEQUENCE:
              System.out.print((String)token.getCompanion());
              break;
            case Token.BLOCK_COMMENT:
              if (len > 0) {
                System.out.println();
                len = 0;
              }
              break;
            case Token.WHITESPACE:
              if (len > 75) {
                System.out.println();
                len = 0;
              } else if (len > 0) {
                System.out.print(' ');
                len++;
              }
              break;
            }
          } finally {
            // never forget to release resources and references
            tokenizer.close();
            source.close();
          }
        }
      } catch (Throwable throwable) {
        // catch and print all exceptions and errors
        throwable.printStackTrace();
      }
    } // end of main
  } // end of class

Origin of JTopas

JTopas is a collection of Java modules, resulting from some Java experiments and solutions, I have done when nothing more important occupied my spare time. Since these modules can more or less easily be reused by others, I decided to make them available to the public.

Besides, I wanted to have my own Open Source project ;-)

The small picture beside the headers and in the top frame of this page are drawings of a chestnut by Suse S. It is not exactly a logo, but looks good ...

In the hope, JTopas will be usefull
Heiko Blau

Contact: webmaster

Last modified: Sun Nov 05 18:51:26 CEST 2004