This is an all-at-once design document for the English grammar currently
shipping in BibleTrans. To understand how we got here, see the "English
Grammar Step by Step" tutorial in the BibleTrans on-line documentation
(linked from the Welcome screen).
As I begin to write this, there are large parts of this grammar -- and apparently the translation engine itself -- that I no longer remember how they work. I have marked these with a "Sad Mac" icon to remind me to come back and study them in more detail, perhaps by running a translation and watching the log for what happens. By the time you read this, all such icons should be gone and replaced with corrected descriptions.
First we start with the general structure of a BibleTrans grammar, without reference to any particular language.
Grammar rules are placed in the export file in the order they must be found upon import, which depends mostly on which rules must be seen before they can be mentioned in other rules. Rules may be created and modified in the editor in any order, but some rules cannot be completed until they rules they depend on have been properly defined. This is the file order:
Global VariablesRadio buttons and Checkboxes are generally dispersed through the file, associated with the grammar elements they are linked to.
Categories (including Category variables and linkages)
Tables
Conditional Value rules
Set-Variable rules
Syntax Lines
Lexical Form rules
Node Shape rules
Morphological rules
Lexical rules
The Syntax Lines specify the word and phrase order of the translated text by means of the order of the tiles in the line. Each tile may be:
* Some specialized output items like capitalization or space insertion/deletion, or a User category initialization (which instantiates new blank copies of all the variables in that category),
* Another Syntax Line, which defines the component order of that particular part of its parent line,
* A Set-Variable rule, which assigns values to variables used later in the Syntax Line or outside it,
* A Conditional Value rule, which returns a value -- typically a word to be emitted as part of the translated text,
* A Table rule, which is a more structured way to return a value that can be emitted, or
* A Variable containing some piece of text or a number, which is emitted as part of the translated text, or else
* A Tree variable, which causes the translation engine to "walk" that subtree and do whatever the lexical rules for the nodes in that subtree call for, typically invoking other Syntax Lines recursively.
The Syntax Line rules initially get invoked by being attached to a Node Shape rule, and each of Syntax Line's designated variables linked to a named subtree of that node by a line that "connects the dots". The Node Shape rule does not determine translated text word order, only which Syntax Line gets used, and what to do with the subtrees, which are typically connected to tree variables to be walked at a later time, when that variable occurs in a Syntax Line.
Syntax Line rules can also be invoked by being one of the tiles in another
Syntax Line.
The first thing a Node Shape rule does is, if it has a selected ("early") Set-Variable rule, that rule is invoked to set up any Variables that may be used in the remainder of the Node Shape rule, most notably including a variable that may select which variant ("line") of the Node Shape rule to use. Once a variant is selected, it specifies a Syntax Line to use, and the variables listed as available to that Syntax Line get linked individually (by "connecting the dots") to subtrees of the node that invoked this Node Shape rule.
A Set-Variable rule determines the value to be assigned to one or more Variables in its scope, each variable in one line, in the order they are assigned. Variables can be set multiple times to different values in the same rule by the same variable appearing on several lines, each new value possibly depending on its previous value. The value given to a variable can be a constant (number or text), the value from another variable, or else whatever value is returned by a Table or Conditional Value rule.
A Conditional Value rule consistes of a series of condition lines, with a value to be returned for each condition. The last line is always a default value to be returned if none of the conditions evaluated true. The conditions and the returned values may be any combination of other values, derived from a Table lookup or another Conditional Value or a Variable, combined with arithmetic or string or boolean operators in any way that makes sense.
A Table rule specifies one or two numeric Variables as indices, then selects a row and possibly a column based on those values, and returns the Table value there. Table values must be constants, either a number or a text string. Tables are useful for efficiently building inflection affixes and pronoun stems. Conditional Values are more flexible but also more cumbersome, because they can depend on more than two variables (or even values not in variables) and can return values that are not constant (that is, the same conditions might return different values at different times). A table with no index variables can be used to present a constant value to a Syntax Line, or a large string constant to be assigned to a variable in Set-Variable rule (which otherwise limits its constants to four or fewer characters).
There are also category tables for giving constant values to specified Variables when particular nodes are encountered in a tree walk.
Variables may be defined within any of 31 categories, or else global and visible to any rule. Most rules have a selector for choosing which category from which to choose its variables. Global variables can be further distinguished as subtree (or "Tree") variables; using a Tree variable in a Syntax Line walks that subtree instead of generating output directly.
All the variables of a given category are created new and blank whenever
that category is opened, such as during a tree walk by entering a Proposition
0.4 or Thing 0.3 node, or one of the Discourse Relations, or explicitly
with one of the specified User categories
in a Syntax Line. These new variables disappear
again when the rule that instantiated them is finished and exits.
All the category variables -- including pasty, which is examined in Conditional Value 22 Scrooge and assigned then incremented in Set-Variable 29 AdjPast -- are initialized blank, so Set-Variable 29 AdjPast has no effect except for concept 0.11 Narrative, where pasty gets initialized to 4 in the category table.
In any case, as the subtrees of any coordinating relations get walked,
more coordinating relations will be encountered, or else propositions,
Syntax
Line 3.
Again, I seem to have fogotten exactly how it happens, but some subordinate
clauses get fronted only sometimes, which according to my notes, is flagged
in the variable Sub1st, but I seem to have forgotten how.
Some semantic roles are fronted, for example in relative clauses, but it takes some effort to find them, because the semantic tree does not politely mark them for English (other languages being different, and this being a language-neutral tree). Counting the process of setting up subject-verb agreement, I needed 53 lines of Set-Variable rules, which I subdivided into Set-Variable 3 SetAgree followed sequentially by Set-Variable 9 R1Fro, 10 R2Fro, and finally by Set-Variable 11 R3Fro. Some of the semantic roles will be blank (unfilled from this node shape) and default to not fronted. Hopefully only one of them qualifies as the relative pronoun for fronting.
Once the initialization is complete, we can start to emit the fronted subordinate proposition PreProps (if any), then whatever fronted role was selected in FrontedRole (if any), then the subject and verb including subject-verb inversion, followed by whatever semantic roles didn't get fronted, adverbs, prepositional phrases, and non-fronted subordinate propositions (inclusing the content of orienters).
English adverbs can also occur before the verb or in several other places in the clause, but it is not ungrammatical to put them all here after the semantic roles and before the prepositional phrases. Prepositional phrases can similarly be moved around somewhat for emphasis, but these subtleties are beyond the scope of this grammar.
We could do subject-verb inversion in either of two ways, possibly by replicating the subject and/or everywhere it might occur, but suppressing it in all the places where it is not, or else by replicating the verb phrase on both sides of the subject, and suppressing in each instance the parts of the verb phrase that belong to the other side. I chose the former strategy for negation, and the latter for subject-verb inversion, as each seemed slightly simpler in their place. English verb phrases are by no means simple, but they aren't too difficult if you understand how they work. In order to repeat some or all of the verb phrase before the subject, I gave it its own Syntax Line 13 VerbPhr.
Between the two passes through the verb phrase I used Set-Variable 14
RevertSele to restore a couple variables to initial conditions that otherwise
get altered in processing the verb phrase, and to set category variable
vPh2 = 2 so Conditional Value rules that depend on which pass is active
can know that.
1. The first slot is uninflected and used to signify future or ability or subjunctive, or else contains the properly inflected helper verb "do" which is used for emphasis or for subject-verb inversion when no other slot is filled. When the first slot is filled, the next non-empty slot is always in the infinitive form.
2. The second slot can only be filled with the properly inflected helper verb "have" and controls the perfective tenses. If it is filled, the next non-empty slot after this is always in the past participle form.
3. The third slot is the helper verb "be" and controls the continuative aspect. If present, the next non-empty slot after this is always in the present participle (+ing) form.
4. The fourth slot is also the helper verb "be" and controls the passive voice. If present, the next (and last) non-empty slot (which is always the main verb) is again in the past participle form. The third and fourth slots are distinguished by the inflected form of the next word.
5. The fifth and final slot is the main verb, and is never omitted. However its form is determined by the most recent non-empty slot (if any) preceding it. Most first slot words are uninflected; otherwise the first non-empty slot is inflected for present or past; all subsequent slots are inflected as determined by the previous non-empty slot.
Negation and subject inversion (questions) both insert their respective components after the first verb slot, but never after the main verb. If no helper slots would otherwise be filled, then the properly inflected helper verb "do" is inserted in the first slot (with corresponding effect on the main verb). All of this complexity is supported in the VerbPhr Syntax line by replicating the verb part in category variable vWord five times, preceded each time by the setup rule Set-Variable 13 SlotPrep to ensure that the properly inflected helper or main verb (or else empty) is in vWord each time, and then followed each time (except the last, where it cannot) by Conditional Value 6 NegQ to insert the negation (if any) exactly once where it belongs.
There are a number of Conditional Value rules supporting the work of Set-Variable 13 SlotPrep. Conditional Value 7 DidNeg responds to the same conditions as Conditional Value 6 NegQ, but returns the opposite result, so that after the category variable negate has been inserted, it is cleared to blank and subsequent insertions have no effect.
The inflection of each slot is controlled by the previous non-blank slot. This is recorded as a number in category variable VformSele, which selects one of eight inflected forms from a list in each verb. VformSele is initialized with the proper inflection required for subject-verb agreement (and past, if so) by Set-Variable 3 SetAgree calling upon Conditional Value 3 SubjAgree to one of the following values:
1 Infinitive (only after slot 1)Conditional Value 5 NonBlankSlot examines the slot number in category variable SlotPosn, and if the slot was non-blank (category variable VerbParts) in the appropriate slot number, sets VformSele to 1,2, or 3 as needed. This is done after the verb has been emitted for this slot, and before the slot number in SlotPosn has been incremented for the next slot.
2 Past participle (only after slot 2 or 4)
3 Present participle (only after slot 3)
4 1st-person singular present
5 3rd-person singular present
6 Plural or 2nd person present
7 Singular past, except 2nd person
8 Plural or 2nd person past
Conditional Value 4 MakeSlotVerb looks at the slot number SlotPosn, then chooses the appropriate helper verb for that slot, or blank if none is called for. There are eight comma-delimited items in the full specification of a verb inflection (see the list above); Conditional Value 13 InflectVerb will chose one of those based on VformSele, but if that item is blank it defaults to the first (infinitive) item as a convenience. When subject-verb inversion is in effect (category verb InverQ is non-blank), InflectVerb also returns blank for every slot after the first non-blank during the first pass, and only for the first non-blank slot during the second pass; otherwise it is entirely blank during the first pass. This gets the first non-blank slot in front of the subject when needed, and not otherwise.
Predicate adjectives are treated like a passive inflected verb, except that the adjective is in the main verb slot. They are distinguished by a separate Lexical Form (preserved in the predefined variable LexRuleForm).
Conditional Value 31 Get1verb is used only in comparisons, which are
handled separately through Syntax Line 31 CmprLine.
Another form of orienter is imposed on us by the need to present impersonal
remarks about some action, such as its necessity or advisability. This
is handled in the second variant of OrienterProp, and the subject-verb
relation of the internal proposition is converted to a possessive of an
abstract noun, such as "My trip was fun" for "I enjoyed that I travelled."
However, I seem to have forgotten exactly how this works.
PN-Conditional Value 31 GetDiff does most of the work in deciding how to realize the comparison. Its default value (the last line) invokes a special Tree comparison operator, which compares the two propositions: if they are structurally identical except for a single semantic role, then that role is pulled out into category variable DiffNoun as the standard against which to run the compare, and we choose the second variant of the "*compare" node shape rule, which is also the second line of Syntax Line 31 CmprLine; otherwise DiffNoun is left blank and we get two full propositions using the first variant and line respectively.
Set-Variable 31 CmprSetup calls on PN-Conditional Value 31 GetDiff five
more times to separately extract the adjective or adverb from the previous
treewalk, and then to determine (or fabricate) its comparative form, and
to strip off the semantic role node from over the subject (if that is the
DiffNoun)
but not off the other roles. These could be separate Conditional Values,
but I was running out (the next version will have a lot more to use). Set-Variable
27 BlankMore is invoked between the two propositions to reset a couple
variables for the second proposition.
AdposLn line 1 is the default case, which handles most prepositional phrases and also all semantic roles except subject. The direct object comes through here with its "preposition" in lexical parameter AdpLex already blank. In some cases another semantic role gets realized as direct object in English, so the PN-Conditional Value 8 OptAdLex suppresses the preposition for those cases. The object of the preposition is always in the Objective case (category variable NounCase = "O").
Line 2 is selected for stative propositions 0.102 ClassMember and 0.103 Attributive, and for 33.126 Name. It seems to be just a way of stirnging some words together, with no case management or any such thing.
Line 3 converts the prepositional phrase "of [whatever]" into a possessive which is then fronted before the head noun of the noun phrase by magical (unspecified) means.
Line 4 similarly converts the prepositional phrase into a possessive, but prefers the nounless possessive ("mine", "yours", "hers" etc). It seems to do this when a predicate adjective is used statively, that is, directly under a 0.4 Proposition node, but I don't understand why.
Line 5 is used for apposition 0.225, and also for coordinating more than two Things in a list with 0.222-0.224; if there are only two, then 89.92 "and" conjumction is the preferred tree encoding. These take the inherited noun case, and emit a comma or "and" separator.
The object of a preposition is always a noun phrase (0.3 Thing) or possibly
a Thing over a conjunction of Things. We look at Syntax
Line 23 NounPhr (noun phrase) next.
Category variable Possessor comes next (or rather, in its place, since they won't both be non-empty) if there is a fronted possessor (a pronoun, or else a proper name which can have "'s" added), as determined by Early Set-Variable 11 PosOrPrep calling PN-Conditional Value 11 PrePossess repeatedly.
We have two classes of adjectives, quantifiers and other kinds which come first, then (the default) other kinds of adjectives, which come after the quantifiers. These are determined lexically by a "1" in the Lexical rule parameter Jclass, which is otherwise blank. We run through the list of adjectives in category variable UsedAdj (copied from tree list variable Adjects) twice, presetting category variable Jseq to 1 or something else for matching against Jclass, and then let Syntax Line 24 AdjecLin pick and choose them in each pass as they come up.
Certain kinds of kinship terms (in category variable LateKin)
and Titles (both handled by Syntax Line 22
WhoseKin) also precede the head noun, which is inflected in Conditional
Value 23 InflectNoun, if it is plural and/or possessive case. It is then
followed by explicitly updating the pronoun reference (which is thus effective
in the following phrases), then any prepositional
phrases and relative clauses.
The pronoun generator completely overrides normal Thing handling, so
because we still want descriptive relative clauses and some adjectives,
these are added onto the PronoGen Syntax line.
PN-Conditional Value 22 KinOpt finally returns either the lexical form
of the contained noun, or else blank if unneeded.
This is all the Syntax Lines. Hopefully,
after I clean up the gaps, it should be a fairly complete description of
the English grammar.
Working Draft, 2013 February 6