The German Journal

Wrestling with Human Cognition…

12 min readSep 9, 2018

If you’re reading this and you’re not me, you’re probably doing it wrong. There are articles that, though also read by almost no-one else, had been intended for a broader audience.

I’m making a web app for myself to learn German. The reason for writing all this down is to remember what I’m doing — looking at all the images and notes and sketches collected over the years, I used to be a creative man, someone full of ideas and optimism and aspirations — now I’m a husk of a man who can’t even remember what lunch was yesterday.

When I read a README written by past self, that really helps:

That’s why I’m gonna write things down.

So what’s on the agenda? Cleaning up data. Currently, each entry looks relatively messy, look at the mis-aligned IPA, the unstyled lists…

The scraped HTML source looks even messier, with lots of levels and lots of unexplained attributes. Today I spent probably an hour looking for a piece of A4 paper where I annotated with colourful highlighters the structure of these entries — to no avail — though I found some other things that could not be found when they were being searched for…

All those notes I thought I would “come back to someday”, in Evernote, in email attachments, in photos, in scans, in Pocket, in bookmarks, become practically impossible to find when you want to find something…

Speaking of being organised, I used to have a whole setup built, following David Allen’s system, 43 folders and colour-coded filing and whatnot…

So the first thing — and this was an idea from before but I could not remember the details — is to build a visualiser:

It has a web server that read local files (raw lexical HTML)
I can browse through different files quickly, e.g. using the arrow keys
The page will request the file content then inject it into the page
There’s common CSS styling

I then write the CSS according to what the file structure is conceived to be, and transform the files to prune exceptions…

Lexical Visualiser

More ideas are coming back from my memory, for example, I wanted to use pseudo-elements to add tiny labels to display elements on the page.

Good news! Found the sheet of annotations when casually clearing a benchtop…

Add a “Hello World” Express server:

Make a static directory public.
The web interface, the UI code for controls, and the stylesheet live in this public directory.
Add a place-holder GET endpoint

The UI-Server cycle:

Add a key-down listener on the page
When a certain key is pressed, the page makes an Ajax call to an endpoint defined in Express
Express reads data from the file system, and send back the response

Example:

User presses ‘Right Arrow’ on front-end
This fires an Ajax call to ‘/next’ to the host
Express has a handler for ‘/next’, which sends back the payload
Front-end renders the payload (HTML fragment)

Adding IPA Font

Add Charis SIL font:

@font-face {
  font-family: Charis;
  src: url('http://site/fonts/charis.ttf');
}.ipa {
  font-family: 'Charis', serif;
}

Cleaning Up Headwords

What to do with numbered homonyms?
Gender and regional variants (inside the headline block)?
Grammar labels (e.g. ‘separable’)?
“Word Forms” outside the content block
“Word Forms” inside the content block — and sometimes IPAs
Labels inside .entry-content .lexical-group

Oh, what a mess…

The dangling text node of H2 should be moved to a separate element

The number of “.lexical-group” DIVs can be more than one… possibly applying more to verbs than nouns:

Next steps:

Change the class of first H2 .orth block to “canonical”
Replace exclamation mark with an “interjection” label
Break up headline block at <em class=”italics”> or </em> then into sibling wrapper elements

Some useful notes from 2015:

Lexicography with Mongo: 7

Navigating a Corpus of German Quotes

medium.com

Lexicography with Mongo: 8

Lemmatisation and Frequency Bands for German

medium.com

Lexicography with Mongo: 9

Annotating Sentences with Lexical Information

medium.com

Now it’s equally hard to remember a new word as to remember the inspirations in my mind years ago…

I find it hard to “learn” with this Slideshow mode. Why? It’s good as a prompt, as a reminder, a shorthand for something you already know.

What does it mean to remember? Connecting X to Y, well, X may have a dozen of forms, inflected, separated, in various context of varying quality… Y itself is a long list of sense, a slippery nebula of meaning.
What does it mean to learn? First I would need a good grasp of the nebula of the target (the meaning object). Then I can connect it to the symbol (a word form).
The short form English gloss really doesn’t do anything for me.

Inspecting the Headline Element

I added a dry-run mode in the transform task runner.

Apparently, all files have exactly 1x ‘.entry h2.orth’.

The next step is to check the variety of content nodes (including dangling text nodes, blank nodes) out there.

achtundsechziger_1.html
1 Achtundsechziger
2 <div class="ipa" data-audio="de_achtundsechziger.mp3" data-orth="Achtundsechziger">axt|ʊntˈzɛçtsɪɡɐ</div>
3
4 <span class="pos">masculine noun</span>
5 <em class="italics"> or </em>
6 Achtundsechzigerin
7 <div class="ipa">-ərɪn</div>
8
9 <span class="pos">feminine noun</span>
10 <em class="italics"> or </em>
11 68er
12 <span class="pos">masculine noun</span>
13 ,
14 <span class="pos">feminine noun</span>

Maybe the first next thing here is to wrap all the text nodes. No more dangling nodes! This was quite a success:

Hierarchical Ambiguity

There at least two kinds of complications:

One is when there are multiple variants. The variants are visually separated by an “or”:

<em class=”italics”> or </em>

Also as an alias:

<em>,&nbsp;&nbsp;</em>

Yet this separator, the headword (used to be dangling text nodes), their grammar / register / regional labels all reside as siblings on the same level!

The second is much rarer, but a bigger problem:

<em class=”hi”>or</em>

It occurs in the middle of a phrase. For one thing you don’t know which are the two interchangeable parties connected by “or”. Then, the text nodes that should otherwise be a single string now strides this special separator…which itself is not a plain string, why not add an onclick attribute why you’re at it?!

The simpler solution is to just replace it with ‘/’. The replacement however should be run before wrapping nodes, so the phrase remains a whole string.

Odd Exclamation Marks!

For a while I though these exclamation marks appear in interjection phrases (indeed a few are), but really in most cases it makes no sense:

Why such excitement? It’s just a profession.

Tidying Up Labels

Look at this special madness:

Word Types

Today, by random drifting of thought, I began to wonder, what is the good ratio for learning different types of words. Let’s say I’m aiming for 20K lemmas, how many of these should be nouns (there’s a bunch of compounds, cognates, proper nouns, etc.), how many verbs, how many adjectives…

With a quick search, Erwin Tschirner’s 4K list has:

1,782 nouns
1,068 verbs
660 adjectives and 250 adverbs (no idea why…)

That’s about 40% nouns, 25% verbs, and about 20% adjectives…

But the size of this list is 4K. As the size increases, you may start to run out of verbs, there are only so many… Nouns just fluffy and puff with compounding — no or minimal effort to learn… Adjectives can be useful for describing stuff… it also grows by extending derivatives — there’s only so many shades of commonly effable distinctions along the spectra of qualia…

Verbs of which I have collected (and cleaned up) conjugation data: 1,853.

Looks like 2K verbs is a good place to stop — wait, I used to have a big bad book of German verbs, 555 verbs featured — anyway, even 1.5K would do.

Grab the synonym graph. Dexterity in navigating nuances is more useful than knowing obscure words.

2,000 proto-lemmas would be a good target for nouns, that is, if words are recursively reduced to a “hub concept”, e.g. “imagination, imaginary, etc.” can be reduced to “imagine” (but not “image”), then these core numbers can fan out to a very large repertoire.

Adjectives are an interesting topic. One day I’d like to build a large collection of “scales”, e.g. “freezing, chilly, cool, warm, hot, scorching”, then at each parallel it can also expand sideways to synonyms (“nice, friendly, affable”).

At the very basics, there are a bunch of contrast pairs. These could the the core, non-derivative adjectives, probable around 500 of these or more. The rest you can make up from nouns (-haft) and verbs (-d).

Phonotactic Training

New idea! I have a dataset of IPA transcriptions:

ˈapɡəʃiːdn
ˈapɡəʃlaft
ˈapʃlafn
ˈapɡəʃlaːɡn
ˈapɡəʃmakt
ˈapɡəzeːən
ˈapɡəʃtɔrbn
ˈapɡəʃtraːft
ˈapɡəʃtʊmpft

It so happens these also correspond to audio files (hopefully human).

Here’s the idea:

Go through the transcriptions and find all 2-grams (and 3-grams)
For each n-gram, find the most frequent words
Use the audio of these words as examples for phonotactics
Needs special case handling for stress (ˈ) and quantity (ː) — actually it’s better to have plausible onset, nucleus, coda defined — including diphthongs, clusters, etc.
Sometimes the boundary and therefore the grouping won’t make sense, e.g. in separable verbs shown above

Probabilistic Gender

At some point I’d like to remember the gender of most common nouns. Once you go beyond 4K down the frequency list, a lot of the nouns are compounds, rather than entirely new nouns. Can the learning space be condensed?

In terms of meaning, for example, ‘wineglass’ really is just a ‘glass’ for ‘wine’. But, the meaning of ‘carpet’ is not a type of ‘pet’ to do with ‘cars’.
On the other hand, predicting gender should be most reliable when it’s based on compounds. In all reasonable universes, the gender of ‘Olympiamannschaft’ should be deterministic from ‘Mannschaft’.

I was thinking, technically, how to detect compounds, and analyse compounds into its components — then I would only have to remember the gender of the stem noun, rather than a dozen of its derivatives…

I was considering many ways of computing on substrings. Then I thought, why stop there? Let’s just run n-grams from the right-hand side:

Filter for a list of nouns from the 40K list
Each noun’s gender is known from the dataset
For each noun, map a set of n-grams from the right-hand side, and
Register its gender in a hash of n-grams
The n-grams includes up to whole words — skipping the hassle of separate methods for detecting compounds
For each n-gram, get stats on the probability of gender

If in 99%+ of cases, ‘-schaft’ is a feminine noun, then we have found a high-confidence ending for predicting gender. Then it goes down the slope from “stems” to “rules” to “lores”, with probabilities for ‘-aft’, ‘-ft’, or just ‘-t’.

Then we gather the most prolific (with a large number of instances) high-confidence (.95, .99, etc.) endings, and put them into a card deck.

Add some red/gray/green colour-coding; add some swiping UI; add local storage for recording attainment…

At some point, we could also add “tags” to nouns, for example, sometimes the gender can be predicted by thematic type, e.g. “season”, “month of the year”, “currency”, “weather”, “scientific unit”.

I also entertain a vague theory that speakers have vague, internal, implicit theories for categories (the very definition of grammatical gender). Some languages have a handful of genders. Let’s say in the imaginary world, nouns have hundreds of genders — these then get collapsed into a handful — then further into 3 or 2.

My vague theory is — that people still make up vague theories about these categories in their mind about what the gender of a noun is likely to be. For example, I have some theories that containers are likely to be feminine, and a “thing” or “object” of interchangeable quality (as in a commodity or material) is likely to be neuter… How does this work in practice? It’s going to be like horoscopes, we’ll just to have to brainstorm up a lot of these “theoretical categories”, and assign them to nouns and see what sticks.

There’s also a theory that information with an overarching vague theory behind is a lot easier to remember… Humans are a theorising species.

Word Types

I got some new data to provide usage examples — solves “drawing a blank about Y” problem… The 34K entries are predominated by nouns, I can browse miles and miles without seeing another type. But thanks to the larger numbers overall it does have other types:

Nouns, plurals, proper names, numbers: 23,174
Strong verbs: 837
Weak verbs: 3,496
Adjectives: 5,209 (excluding adj. nouns…)
Adverbs: 568

Soft Hyphens?!

Today I discovered the bizarre world of soft hyphens.

Only this soft hyphen is not visually detectable normally…

Declensions

If by this point you still don’t think German is more complex than English and therefore harder to learn (amongst other reasons), you are just deluding yourself, you are flat-earth, dinosaurs-on-ark delusional.

Back to Noun Gender

Now I have a relatively clean set of data, with a large number of nouns (including compounds) and the gender data. Maybe it’s time to experiment with the probabilistic n-gram…

Normalising Multi-level Structures

Lexical Categories

Also listed as “Part of Speech” (PoS)… The older dataset has a madly large number of these — it’s actually just messy data:

1st and 3rd person plural
1st person singular present tense of verb
2nd person plural of present
2nd person singular of present
3rd person singular present tense of verb
abbreviation for
accusative
adjective
adverb
auxiliary verb
comparative
conjunction
contraction of
dative
definite article
demonstrative pronoun
exclamation
feminine noun
genitive
imperative
impersonal reflexive verb
impersonal verb
indefinite article
indefinite pronoun
interrogative adjective
interrogative adverb
interrogative pronoun
intransitive verb
masculine noun
modifier
neuter noun
number
past participle of verb
past tense
personal pronoun
plural
plural noun
possessive pronoun
prefix
preposition
pronoun
reflexive pronoun
reflexive verb
relative adverb
relative pronoun
singular imperative
suffix
suffix forming adjectives
superlative
transitive impersonal verb
transitive verb
unknown

Exit Criteria

This data wrangling is dragging on for too long. I need to define when to stop and perhaps move on to starting the next project. The goal is to have an app that I can use on a daily basis. That means a simple interface.

For the data wrangling, the exit point is to get the data “ready for extraction”. That essentially means “flattening” the sense list into a uniform structure, where the leaf nodes are plain text instead of annotated HTML.

That means no more inline special markers:

Decide what should form a separate line or line-group, e.g. a phrase and its translation, or sense and translation, or context and translation…
Extract the markers as attributes of that line (sense, register, context)
Use various methods to break up the LI into multiple lines

Now, Some Digression

“Even if you find it unusual, the c-test is used worldwide for placement purposes and has also been substantially researched (see www.c-test.de).”

Well, you can hit people with a shelf of dissertations on the subject, still no normal person speaks this way:

Words are not exactly Gestalt triangles — even if they were, cutting the triangle in the middle defeats the effect. Indeed, apt readers need only the bookends or edge shapes (with the middle scrambled) to skim text and get the meaning, especially when not trying too hard… but leading clozes obviously don’t work the same way. Oh well, leaving the coda would reveal affixes, I imagine these researchers also claim they want to test grammar.

The German Journal

Wrestling with Human Cognition…

Lexical Visualiser

Add a “Hello World” Express server:

The UI-Server cycle:

Example:

Adding IPA Font

Cleaning Up Headwords

Lexicography with Mongo: 7

Navigating a Corpus of German Quotes

Lexicography with Mongo: 8

Lemmatisation and Frequency Bands for German

Lexicography with Mongo: 9

Annotating Sentences with Lexical Information

Inspecting the Headline Element

Hierarchical Ambiguity

Odd Exclamation Marks!

Tidying Up Labels

Word Types

Phonotactic Training

Probabilistic Gender

Word Types

Soft Hyphens?!

Declensions

Back to Noun Gender

Normalising Multi-level Structures

Lexical Categories

Exit Criteria

Now, Some Digression

Ready for Extraction?

Written by David Rosson