ACOPOST - A Collection Of POS Taggers
News
- 2002/09/23
-
Renamed ICOPOST to ACOPOST and moved the package to the
Sourceforge repository of open source
projects. Released version 1.8.4 which contains a
preliminary user's guide.
The project urgently needs maintainers, admins,
developers, active users etc. since I (Ingo Schröder)
won't have the time to maintain the package in the
future. Please mail me at ixs at users.sourceforge.net.
- 2002/04/24
-
Release 1.8.3BETA, contains an additional tagger based on example-based
techniques, not documented on this page!
- 2001/08/21
-
Release 0.9.0 (first public release).
- 2001/08/20
-
Cleaned things up.
- 2001/07/14
-
First public talk about ICOPOST.
- 2001/06/08
-
Web page started.
What's ACOPOST about?
Part-of-speech (POS) tagging is the task of assigning grammatical
classes to words in a natural language sentence. It's
important because subsequent processing stages (such as parsing)
become easier if the word class for a word is available.
Here's an English example of a tagged sentence taken from the
Wall Street Journal of the Penn Treebank:
Measures NNS
of IN
manufacturing VBG
activity NN
fell VBD
more RBR
than IN
the DT
overall JJ
measures NNS
. .
ACOPOST is a set of freely available POS taggers that I modelled
after well-known techniques. The programs are written in C and
run under various UNIX flavors (and probably even under Windows).
ACOPOST currently consists of four taggers which are based on
different frameworks:
-
Maximum Entropy Tagger MET:
This tagger uses an iterative procedure to successively
improve parameters for a set of features that help to
distinguish between relevant contexts. It's based on a
framework suggested by Ratnaparkhi [1997].
-
Trigram Tagger T3:
This kind of tagger is based on Hidden Markov Models where
the states are tag pairs that emit words, i. e., it's
based on transitional and lexical probabilities. The
technique has been suggested by Rabiner [1990] and the
implementation is influenced by Brants [2000].
-
Error-driven Transformation-based Tagger TBT:
Transformation rules are learned from an annotated corpus
which change the currently assigned tag depending on
triggering context conditions. The general approach as well
as the application to POS tagging has been proposed by Brill
[1993].
-
Example-based tagger ET:
Example-based models (also called memory-based,
instance-based or distance-based) rest on the assumption
that cognitive behavior can be achieved by looking at past
experiences taht resemble the current problem rather than
learning and applying acstract rules. They have been
suggested for NLP by Daelemans et al. [1996].
A detailed description, an extensive evaluation and new
suggestions can be found in an accompanying technical report
[Schröder 2002].
Further information
The project page at Sourceforge can be reached at
http://sourceforge.net/projects/acopost/ where the
latest releases can be found.
Mailing lists are available for announcements, for developers
and for users at
http://sourceforge.net/mail/?group_id=62355.
References
Thosrten Brants. 2000. TnT - as statistical part-of-speech
tagger. In Proceedings of the Sixth Applied Natural Language
Processing Conference (ANLP-2000), Seattle, WA, USA.
Eric Brill. 1993. Automatic grammar induction and parsing free
text: A transformation-based appraoch. In Proceedings of the
31st Annual Meeting of the ACL.
Walter Daelemans, Jakub Zavrel, Peter Berck & Steven Gillis.
1996. MBT: A memory-based part of speech tagger-generator. In
Eva Ejerhed & Ido Dagan, ed., Proceedings of the Fourth
Workshop on Very Large Corpora, pages 14-27.
Ingo Schröder. 2002. A Case Study in Part-of-Speech tagging
Using the ICOPOST Toolkit. Technical report
FBI-HH-M-314/02. Department of Computer Science, University of
Hamburg. Available from
http://nats-www.informatik.uni-hamburg.de/~ingo/papers/.
Lawrence R. Rabiner. 1990. A tutorial on hidden markov models
and selected applications in speech recognition. In Alex Waibel
& Kai-Fu Lee, ed., Readings in Speech Recognition.
Morgan Kaufmann, San Mateo, CA, USA, pages 267-290. See also
Errata at
http://www.media.mit.edu/~rahimi/rabiner/rabiner-arrata/.
Adwait Ratnaparkhi. 1998. Maximum Entropy Models for Natural
Language Ambiguity Resolution. Ph.D. thesis, University
of Pennsylvania.