Self-Test 10

Wed, Jan. 26, 2011

File I/O

Before you start:

Take a close look at this syntactically annotated corpus tuebadz_1-50.utf8.

The first line starts with "%%", which indicates a comment, and has heading names for the columns (word, tag, morph,...).
The second line starts with:

The next lines are of the form: (These are the lines that we are interested in)

Then there are some lines that start with

Followed by the End Of Sentence indicator:

The file then continues with #BOS / #EOS pairs till the end.

For now we are only interested in the lines with words on them, which are all of the lines not starting with "%%" or "#". Each of these lines have a word followed by a tag that indicates the part of speech (NN for normal noun, ART for article, etc), and then by a string that gives the appropriate morphological values for the word.

The morphological value for a normal noun (NN) is a 3-character string:

The character in the first position of the morphological value represents the case ('n' for nominative, 'g' for genitive, etc).
The character in the second position of the morphological value represents the number ('s' for singular,...).
The character in the third position of the morphological value represents the gender ('m' for masculine,...).

A '*' in any position indicates that the value for that position is underspecified (i.e. unknown).

NN - morphological values
position	feature	possible values
1	case	n (nominative) g (genitive) d (dative) a (accusative) * (underspecified)
2	number	s (singular) p (plural) * (underspecified)
3	gender	m (masculine) f (feminine) n (neuter) * (underspecified)

Exercise:

Write a program SearchCorpus that takes 2 command-line arguments (the file name of the corpus and the name of the destination file) and finds all of the singular nouns in the input corpus and prints them to the destination file.

Use BufferedReader (with line.split("\\s+") ), and BufferedWriter.
Ask the user before overwriting a file.

Your output should look like this.