Self-Test 10

Wed, Jan. 26, 2011

File I/O

Before you start:

Take a close look at this syntactically annotated corpus tuebadz_1-50.utf8.

For now we are only interested in the lines with words on them, which are all of the lines not starting with "%%" or "#".  Each of these lines have a word followed by a tag that indicates the part of speech (NN for normal noun, ART for article, etc), and then by a string that gives the appropriate morphological values for the word.

The morphological value for a normal noun (NN) is a 3-character string:

The character in the first position of the morphological value represents the case ('n' for nominative,  'g' for genitive, etc).
The character in the second position of the morphological value represents the number ('s' for singular,...).
The character in the third position of the morphological value represents the gender ('m' for masculine,...).

A '*' in any position indicates that the value for that position is underspecified (i.e. unknown).


NN - morphological values
position
feature
possible values
1
case
n (nominative)
g (genitive)
d (dative)
a (accusative)
* (underspecified)
2
number
s (singular)
p (plural)
* (underspecified)
3
gender
m (masculine)
f (feminine)
n (neuter)
* (underspecified)


Exercise:

Write a program SearchCorpus that takes 2 command-line arguments (the file name of the corpus and the name of the destination file) and finds all of the singular nouns in the input corpus and prints them to the destination file.

Your output should look like this.