|
SSiW |
Resources |
|
ResourcesThis site provides two kinds of resources: 1. lexical information induced from lexicalised PCFGs, and 2. human associations to German verbs. Each resource is explained below, including examples and references.Please note that you have access to the examples only. The full resources are freely available for education, research and other non-commercial purposes. Please contact me to obtain access to the complete sources.
1. Lexical Information induced from Lexicalised PCFGsHead-Lexicalised Probabilistic Context-Free Grammars (HeadLex-PCFGs) represent a lexicalised extension of PCFGs, and incorporate lexical heads into the grammar rules, cf. Charniak (1997) and Carroll and Rooth (1998). As the core of a HeadLex-PCFG, a context-free grammar is developed, with head-marking on the children. The parameters of the probabilistic version of the context-free grammar - both for the unlexicalised PCFG, a lexicalisation bootstrapping, and the lexicalised HeadLex-PCFG - are then estimated in an unsupervised training procedure, using the Expectation-Maximization algorithm (Baum, 1972). The algorithm iteratively improves model parameters by alternately assessing frequencies and estimating probabilities.We used the statistical parser LoPar to perform
the parameter training. The trained grammar model provides
lexicalised rules and syntax-semantics head-head
co-occurrences, as an empirical resource for inducing
quantitative lexical properties at the syntax-semantics
interface. The lexical information can be used for lexical
acquisition and modeling linguistic phenomena. This page
provides lexical information for German and for
English. Lexical Acquisition from the Huge German Corpus (HGC) The German HeadLex-PCFG was trained on 35 million words
of the Huge German Corpus (HGC), a collection of newspaper
corpora from the 1990s. We provide the unlexicalised and
lexicalised grammar files for parsing, empirical word
frequencies, and lexical data on various linguistic
phenomena.
... for verbs:
Sabine Schulte im Walde Experiments on the Automatic Induction of German Semantic Verb Classes [ps.gz/bib] PhD Thesis. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, June 2003. Published as AIMS Report 9(2). [Chapter 3] Lexical Acquisition from the British National Corpus (BNC) The English HeadLex-PCFG was trained on
approx. half of the BNC, 50 million words. The
resulting grammar model was applied to obtain Viterbi
parses for the whole corpus, 117 million words. From
the Viterbi parses we then extracted lexical
information about verbs, subcategorisation frames and
arguments.
Sabine Schulte im Walde Automatic Semantic Classification of Verbs According to their Alternation Behaviour [ps.gz/bib] Diplomarbeit. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 1998. [mainly: Section 2.1 and Appendix A] 2. Human Associations to German VerbsThe data collection was performed as a web experiment, which asked native speakers to provide associations to German verbs.Material: 330 verbs were selected for the experiment. They were drawn from a variety of semantic classes including verbs of self-motion (e.g. gehen `walk', schwimmen `swim'), transfer of possession (e.g. kaufen `buy', kriegen `receive'), cause (e.g.verbrennen `burn', reduzieren `reduce'), experiencing (e.g. hassen `hate', überraschen `surprise'), communication (e.g. reden `talk', beneiden `envy'), etc. The stimulus verbs were divided randomly into 6 separate experimental lists of 55 verbs each. The lists were balanced for class affiliation and frequency ranges (0, 100, 500, 1000, 5000), such that each list contained verbs from each grossly defined semantic class, and had equivalent overall verb frequency distributions. Data: 299 native German speakers participated in the experiment, between 44 and 54 for each data set. In total, we collected 79,480 associates distributed over 39,254 different response types. Each trial elicited an average of 5.16 associate responses with a range of 0-16. We pre-processed all data sets in the following way: For each stimulus word, we quantified over all responses in the experiment, disregarding the order in which associates were provided and, for noun stimuli, the presentation type of the questionnaire. The result is a frequency distribution for the stimulus words, providing frequencies for each response type. The responses were not distinguished according to polysemic senses of the stimuli. Example: klagen Related Resources: References: |