BibleTrans Design Decisions

BibleTrans Design Decision Log

This document records the thinking and events that went into the significant design decisions in the development of BibleTrans to run on the PC. The dates on some of the earlier items are only approximate, as the decision to maintain this detailed log was made in 2005, about a year after the project was resurrected from the ashes of BibleTrans International. There are additional documents describing Data Formats and the Virtual Machine, and another Problems document describing decisions still in process or not yet made.

Topics

Why Resources?	Doc Files	Word Encoding	Text Types
Picture Encoding	Interlinear Text	Resource Types	Parse Codes
Doc Res IDs	Tree Node Encoding	Text Encoding	Tree Verify
External Doc File	Public Greek text	Translation Engine	Language File
Output Text	Active Image	NET Bible	Next Rewrite
OOPS	Resource File Format	ImplicitInfo	Nested Relations
Building the Database	Hebrew Encoding

2014 June 20 -- Hebrew Encoding. The upload earlier this month seems pretty stable, and my Luke 13:8 year is up, so now I'm trying to learn Android as a way to build a new revenue stream. I have an idea that needs Hebrew, so this might also be useful for getting BT to do Hebrew. Before Unicode, all the different implementors used different font encodings for the glyphs. The Unicode glyphs are alphabetical but not particularly memorable, so I came up with this encoding that matches the consonants of most of the fonts. The vowels, if not alphabetic, at least they sort of resemble the Hebrew points. Hopefully (like Greek) I can read the Hebrew text so encoded by sounding it out in a Latin alphabet, which makes sense when a Hebrew font is not available.

My first cut had the dagesh inserted by the software and caps were final forms, but it fit badly. There are enough codes to let caps be the dagesh, so I moved the final forms to other characters (which is less readable in a Roman font, but I can do that substitution very late in the file processing). This is the latest encoding:

2014 May 6 -- Nested Discourse Relations. When Elizabeth Miles was building the trees for Philippians and Luke, I had no working translation engine to test the trees, so some tree structures got built without adequate consideration of what they might mean. One of these is the placement of subordinate discourse relations. She built the trees with a mind to accurate exegesis of the Greek text using existing Louw&Nida definitions -- which is good -- but how does one translate a subordinate proposition that is not subordinate to some other proposition? Now we have tree files with a lot of these situations. Somehow they need to be reshaped so that subordinating relations occur only under propositions. This will require some re-thinking of her exegesis, and possibly defining some additional L&N concept "clones" (same definition, different semantic node shape).

We also have a few cases where the 0.311 ImplicitInfo marker hangs under a coordinating relation, apparently to mark the whole relation (possibly including a containing subordinating relation) as implicit. I now allow this situation when the coordinating relation is under a higher subordinating relation, to which the implicitness can be assumed to apply, and mark the relation the way I already mark Things and Propositions (see ImplicitInfo).

There are similarly a few cases where 0.315 FirstInSurface hangs under a coordinating relation, with the same apparent purpose. The way grammars currently work, optional fronting only works for subordinate propositions. Either this needs to be under a subordinating relation and work that way, or else it must be disallowed entirely. The current implementation allows it, but no grammar recognizes the situation.

See the Problems document for my latest thinking on subordinating relations with respect to coordinating relations.

2014 April 21 -- ImplicitInfo. 0.311 is a marker on Things and Propositions which designates them and their containing relation or adposition or whatever as implicit. The Tree verify code in DoTrees already sets a flag on the 0.3 or 0.4 node to indicate the contained marker, so the execution engine need only look for this flag and does not need to scan the whole subtree. It works this way:

Any subordinating relation or adposition or anything that can (generally must) have a single Thing or Proposition as a subtree. When that lexical rule is compiled, code is generated to copy the current node (the relation or adposition) to a magic variable "Imp Info" then call the BT-defined "Implicit Test" rule to do the following: if the node does not contain a 0.3 or 0.4 as a subtree, or if that subtree is not marked as implicit, or if this is the prescan pass, -1 is stored to "Imp Info" (replacing the node reference) and it returns. Otherwise "Imp Info" is set to -2 and global BT variable "Inplicit" is incremented, then if it was previously zero and the "Include ImplicitInfo" checkbox is checked on, whatever left bracket is in the Implicit beg-end table is generated. If the "Include ImplicitInfo" checkbox is not checked, the lexical rule can test "Inplicit" after coming back, and skip over everything inside if nonzero.

At the end of the same lexical rule the "Implicit Test" rule is called again, and it sees the negative value in "Imp Info" and it returns, possibly after decrementing "Inplicit" and if newly zero, generating whatever right bracket is in the Implicit beg-end table. The zero test allows for nested markers but only brackets the outer one. I have not yet figured out how to promote a 0.316 Focus subtree inside an otherwise implicit tree when generating the translated text.

I guess the user needs to be able to define additional operations to be performed inside the "Implicit Test" rule, but I have not yet figured out how that should work.

Sometimes it's useful to make a single adjective or adverb implicit, which are encoded as a 0.311 node containing a subtree, that adjective or adverb. This is tested in the built-in 0.311 lexical rule, which then either emits a left bracket, walks the subtree, then emits the right bracket, or else skips over it entirely. These subtreed Implicits are not considered when marking a containing Thing or Proposition.

2013 May 8 -- Resource File Format. I couldn't think of where to put this, but it needs to be somewhere, so here it is. The file is structured into 4K-byte (1024 integers) blocks, of three types:

Block 0, Main Index. The first six numbers are the header:

+0 -- 0x1DEAF00D, the file signature
+4 -- the number of blocks
+8 -- (unused) resource name list (always 0 for now)
+12 -- end of types list, beginning of block allocation bits
+16 -- first block of the allocation bits (assumes previous blocks all taken)
+20 -- number of types (max 83 types)
+24 -- first type in the type list, three integers each:
+0 -- number of resources of this type
+4 -- (4-character) type name, packed with first char in high bits of integer (as if Big-Endian)
+8 -- beginning of the resource list for this type, block+offset

All file locations are 31 bits, a 19-bit block number over a 12-bit (integer, 0-1023) offset in the block, for a max file size of 2GB. No data is split across blocks, which (for now) limits the largest resource to 4096 bytes. The numbers are stored in the native integer format of the host system; the file signature distinguishes between Big- and Little-Endian formats (it will be 0DF0EA1D in the wrong system).

+256 -- the start of the allocation bits, one bit for each block allocated (in use).

When there are more than about 6000 blocks allocated (24MB), the beginning of the allocation bits, which will be all ones at that time, is truncated, and the number of truncated blocks is stored in +16. Subsequent deletion of a resource in that truncated region will not reclaim its file space.

Resource Lists. The first three numbers are a sort of header:

+0 -- (block number) link to final list block, or 0 if it's all in the first block
+4 -- next available short-resource place for this type, block+offset (valid only in final index block)
+8 -- (4-character) type name, to make it easier to review in a file dump (reversed if Little-Endian)
+12 -- first resource in this type, four integers each:
+0 -- resource ID number (16 bits, originally for compatibility with the MacOS)
+4 -- name link, if any (always 0 for now)
+8 -- resource location, block+offset
+12 -- size (in integers) of this resource

There is no limit (other than file size) to the number of resources of one type. When there are too many to fit on the initial resource list block, the next number is a high bit over a block number where the list is continued. Initially, small resources are allocated at the back end of the initial list page, filling downwards toward the list. After the page fills up, a new block is allocated for small resources, filling upwards. The block number of resource locations in the list page is always zero.

When I get around to implementing resources larger than 4K bytes, I expect to pack 1023 integers in each block, with a link in the last integer of each block. The resource list will always give the total size in integers, so when it's greater than 1024 the system knows to look for a link at the end of each block.

Data. The resource data is only whatever the program gave it to write. Resources more than a few hundred words are always given their own block. When a resource is replaced by an equal or smaller resource, or if it already has a full block to itself, it always goes in the same place; otherwise it is given a new full block. When small resources are deleted or replaced by larger ones of the same ID, the original file space is not reclaimed, but resources in their own blocks always have that file space reclaimed when they are deleted or replaced, unless the block lies in the truncated region of the bit map.

BibleTrans allows for more than 4K bytes in a resource, by splitting them into 4K blocks, then storing an index of resource numbers at the original resource location. Real Soon Now (that's code for "don't hold your breath waiting") I hope to pull that capability into the Resource Manager, so there will be no limit to block size. They will still be blocked, but linked instead of in a list of fractional blocks.

2010 November 19 -- OOPS. Before starting that rewrite, I decided to upgrade my Turk/2 language to encorporate Object-Oriented Programming Stuff. Most of OOPS hype is a crock (good things that have been around for decades, just not in C, or else Bad Ideas that are not bad enough to overwhelm the good ideas), but the one new language feature not otherwise available (and not widely understood even by passionate C++ programmers) is the use of subclass method overrides to accomplish strongly-typed callbacks. I probably don't really need it in BibleTrans, but OOPS is the fad du jour, so supporting it will make the program ever so slightly less distasteful to prospective programmers trained in the education factories.

2009 May 27 -- Next Rewrite. It looks like I'm running into address space limits. If and when I do a major rewrite on this program, I want to make the following design changes:

Tree-Transform -- Add a new kind of grammar that does temporary tree transforms
Resources -- Keep the 4K block size, but automatically cut and pack larger resources to fit.
Rule Numbers -- Allow for 256 of each kind of rule, no distinct PN-conditionals nor early SetVars.
Morphological -- More of these, too, maybe 3-digit line numbers.
Checkbox Enables -- Eliminate these; if a rule is specified, it's enabled (except pronouns).

2009 February 14 -- NETBible. I ran an episode count on the whole NETBible, with the NT broken out separately, where an "episode" is whatever comes between section titles. About 800 NT episodes is comparable to current GNT divisions, but there are 16672 notes for the NT alone (if you count them as separate DocX resources). I decided to collect the notes into pages, as many as would fit in one DocX resource, but restarting for each NETBible episode, which results in less than 2000 pages. My new allocation (rev. 2009 March 9):

00001+n -- Sequentially assigned to documentation, one per page/screen.
13000+n -- Sequentially assigned to successive episodes in the NETBible NT, 13798 is last
14000+n -- Sequentially assigned to NETBible NT note pages, 15876 is last
17200+n -- Sequentially assigned to L&N tagging Choice pages, 18897 is last
19000+n -- Sequentially assigned to verse index, 19260 is last
20000+n -- Sequentially assigned to successive ABP and L&N definitions, one per concept; 27765 is last
28000+n -- Sequentially assigned to successive popup notes in ABP and L&N definitions, one each.
29800+n -- Other lists of concepts
29900+n -- Sequentially assigned to ABP and L&N ToCs, for each domain n. 29999 is the master ToC
30000+n -- Sequentially assigned to successive episodes in the Greek NT, one each episode
32320+n -- Sequentially assigned to overflow pages (bigger than 4K bytes).

2007 August 2 -- Undo. There is a whole separate document describing the Undo Manager.

2007 June 20 -- Compiled Rules. There is a whole separate document describing the compiled rule code, which may be useful also in the on-line documentation.

2006 July 19 --Active Image Elements. An important part of guiding the user/linguist through the complex and obscure process of encoding linguistic data for BibleTrans to use in making a translation is a combination of document text with active images, which the user can click and/or drag to edit the underlying data. In the file these active images look just like an ordinary picture, but slightly distinguished in its reference, so that the document formatter knows it is active. A designated process registers its interest in a particular window's active images, and thereafter receives the user activity for those image elements, at which time it can update or replace the picture specifications in the file then cause the window to be redrawn. When it comes time to run a translation, these data structures are compiled into virtual machine code which is interpeted by the translation runtime engine.

For BibleTrans (perhaps generally), we want to designate image types to be either one of the following:

PopM 1 Popup menu
Push 2 Pushbutton
CkBx 3 Checkbox
RadB 4 Radio button selection group
EdLs 5 Editable list of text items
Tb2x 8 A 2-dimensional table

or else any of a number of special images, like dragable sentence element line. The type of element is encoded in the image using the indicated code numbers, and sent as a message to the handler process when the user clicks on it.

DrLn 6 Drag line group
CkDL 7 Drag lines tied to checkbox (omit if not checked)
ItLs 13 List of items to be used in drag line
LNtb 10 L&N selection table
CkTb 9 L&N table tied to checkbox
LkTb 12 Lookup table, with user-selected axes
FxLs 11 Editable list of checkbox names
VrLs 14 Editable list of variable names
DotG 15 Dot connector group
Iffy 16 Conditional expression value
SetV 17 Sequence of variables and values to set them to
VrCn 18 Variable connector
PnEx 19 Pronoun selector
Gfnt 20 Character glyphs
ChSt 21 Character set
Mrph 22 Morphological rule

The numbers shown here are encoded in the hot links embedded in document pages; the numbers are passed to the Active Image formatter to choose how to format the data specifications. The specifications for these image elements are stored in the document file as Adat resources, which are copied to the specific language files to be edited by the user. Also in that language file are AcIm resources, the actual picture data. Because these are active data, modifications to one item can alter the way others are displayed, such as for example, turning a checkbox on might enable a table or drag line; these dependencies are stored in Adep resources in the document and language files.

2006 June 16 -- Output Text. Not very many of the intended languages to be targeted by BibleTrans use a strictly Roman alphabet with no diacriticals. Many -- perhaps most -- of them do not even have a fully defined Unicode alphabet to work from. At the very least I need to support two levels of non-Roman font generation: existing Unicode codes and glyphs, and new glyphs on an ad-hoc basis. New glyphs are best met with a simple font editor, in which the user draws in a pixel box assigned to a character number. Once that is working, it's a simple matter to import an existing (installed) Unicode font and present its glyphs for editing in the same pixel box.

In some languages -- notably Arabic and Asian languages like Urdu -- the appearance of a single character depends on its context (neighboring characters). BibleTrans already must (and the existing prototype does) support morphological rules that do exactly that: locally alter the output text on the basis of character adjacency rules. I think the strange script requirements of cursive language fonts can be fully supported by this mechanism. It is not necessary for this to be esthetically beautiful, just show the translated output text in a readable form.

The OTrx resource data will consist of two types of tokens: links to the output text (in a separate EmTx resource), and rule names in the decision tree that led to tokens higher up on the display (see example). For display purposes, every token needs both a rule location (rule name and offset or line number) and a link to the node below it. The top-level (output text) tokens also need a link to the next token to the right, and a reference to EmTx. When morphological rules are applied, there will be another line of text, the readable output (after the morphological rules were applied) in a line above the generated text (before the rules were applied). For details, see Structured Text Output Resources in the Data Format document.

In the example picture, above the top line is yet another line containing "gloss" words. A linguist working in some other language than his mother tongue would typically attach (English) gloss words to each target language vocabulary entry; in this case I used the Louw&Nida concept numbers as the "gloss" since the translation is already in English. The software doesn't really care, the glosses could be French or Swahili or Spanish in countries where that is more natural.

2006 June 1 -- Language File Resources. The program can handle any number of languages, each with its own file of translation rules and Code. A menu displays the available languages, with the current active language checked. After selecting a Tree node to be translated, the Translate step uses the Code resources in the current active language file to build structured representations of the translation decision process (for debugging), also stored in the language file, from which the output text can be extracted into separate text file(s).

The language file resources:

Adat -- Each resource contains the data for specifying the GrammarWhiz rules, as described briefly above, and in more detail in a separate document on Adat resources.

Bitz -- This is a collection of "bit-set" resources used to capture and remember which lexical Rules need review and/or recompilation. Bitz #1 contains a single integer, the bits of which identify the forms that changed since the last review. Bitz #2 through #5 have a bit for each possible lexical rule, =1 for whatever is recorded there: #2 is the non-trivial rules. #3 is rules that need user review. #4 is constructed on the fly, each concept appearing in the currently selected tree when a translation or grammar rule verify is started. #5 is the rules already compiled (currently ignored: everything is recompiled if there are any changes). #7 is 32 integers, one or two lexical rule references from the set of all lex rules that use each lexical rule form (or zero if none).

Code -- Each resource contains one or more compiled Rules as a sequence of byte codes. They are indexed in the CodX resources.

CodX -- The first eight of these index translation Code for 1000 L&N concepts each (see L&NX for encoding); named Rules are assigned reference numbers > 8000, indexed in subsequent CodX resources. Each integer in this index contains a 16-bit Code resource number and a 16-bit byte offset in that resource.

EmTx -- This contains the output translated text, and a word index into the text. The resource numbered by the episode is the index. The episode is a 10-bit number, to which is added an offset counting down from 32640 by -1024 for the resource number of the actual text. Each word of the index has a 12-bit byte offset, a 5-bit resource number (both referring to the other resource), then a length and a pixel size. The same index and text is also used for the structured output (OTrx).

Gfnt -- If the generated text requires diacriticals or non-Roman glyphs, this resource contains up to 255 (nominal 12x12) bitmap glyphs for display after the application of morphological rules. The resource data is in two parts, an index with one entry for each character, the offset into the pixel section where that character starts, then the pixels themselves, formatted the same as the native BibleTrans fonts. The first word of the resource is the size of the index, which could be less than 256. [This data was moved to Adat resources to simplify export.]

Morf -- This resource links the generated text of EmTx to the MoTx after morphological rules have been applied. Each entry gives a character position of the generated EmTx, the corresponding position in the MoTx (which may be different, as substitution rules can add or delete characters), and the rule number that was applied.

MoTx -- This contains the output text, after application of morphological rules (if any).

OTrx -- These resources contain a tree-structured representation of the translated Tree for a particular text. The resource contains both the translated text and a decision tree showing how it was derived, with the tree nodes labeled by the Rule that made that decision. The actual text is stored in EmTx resources and referenced by offset. For details, see Structured Text Output Resources in the Data Format document.

Rule -- These resources contain the structured rules for translating this language. They are indexed in RulX resources. When a rule is edited, it is automatically recompiled into its associated Code. GrammarWhiz makes these rules obsolete.

RulX -- These index the Rule resources using the same format as the CodX resources.

Vars -- The first Vars resources index up to 1000 variable names each; the name strings are stored in subsequent resources.

2006 May 31 -- Translation Engine. The following table lists the virtual machine opcodes in the Mac prototype (BT 0.6), which I will adopt with possible modifications as needed. To start with, I have defined assembler mnemonics, so I can build and test the translation engine before I have GrammarGuru working:

00 Nop error/no operation (can't happen)

01,nn Lino line number nn, for debugger

02,nn OpFr open frame for calling procedure nn

03 CallFr call procedure whose frame 02 opened

04 Stop stop

05 CallLN call L&N proc on ToS

06 AnoLst iterate all nodes in list

07 EnoLst iterate all but last node in list

08 OK procedure exit, ok

09 Done translation completed successfully

0A,xx Jump branch +/-xx bytes

0B,xx BrF branch on false +/-xx bytes

0C,nn NuVar create new var nn of ToS

0D,nn Sto store ToS into var nn

0E,nn Ld push var nn onto stack

0F,nn nn push integer nn

10 False / "" push false (empty string)

11,str "str" push literal string 'str'

12,str Do execute string 'str' as OS script (not yet implemented)

13,x Tree deref ToS+x, which is tree offset

14 Pack PackPt ToS,ToS-1, build 32-bit integer from two 16-bit parts

15 Swap swap ToS

16 Pop pop ToS

17 Dupe dupe ToS

18 LdAtr deref ToS, which is tree; replace it with tree attribute string

19 StAtr deref ToS, which is tree; store ToS-1 as tree attribute string

1A Rot3 rotate ToS below next 2 (not yet implemented)

1B Pgph new paragraph

1C Capz capitalize

1D NoWds no word break

1E Emit emit ToS

1F Gloss gloss from ToS

20-27 + - * /
% & | ^ integer arith/bit operators: + - * / % & | ^

28 Decz decimalize (not yet implemented)

29 Catn catenate

2A-2F < > =
>= <= != compare: < >= <= > = !=

30 Len length(s), s on ToS

31 Offs offset(x,s) returns the offset of x in s, or -1 if not contained

32 Subst substring(i,n,s) is char i through length n of s

33 Replc replace(x,i,n,s) put x into char i through length n of s

34 ItmNo itemno(x,s) returns the item number if x is an item in s

35 DelItm delitem(n,s) returns s with item n deleted

36 Item item(n,s) returns item n of s

37 CntItm CountItems(s)

38 SubTr GetSubTree, same as Tree,3

39 NxtNo GetNextNode, same as Tree,2

3A PutItm putitem(x,n,s) puts x into item n of s, leave result on ToS

3B Nouns extract NounList from reference tree in ToS

3C LNinTr true if ToS-1 tree contains ToS L&N#; 0 tests if it's a tree

3D Bref extract ToS Bref, if any, as 3-item string of numbers B,C,V

3E NouNo extract ToS noun ref, if any, as integer

3F UpNo Get parent tree, same as Tree,1

40 CkTbVrs (GrammWhiz) Set all CheckTable vars from ToS

41 LookTab Replace ToS table ID with its lookup value

42 DWIM "Do What I Mean" = Emit or AnoLst or CallLN

43 GetLN Recover concept from tree node, as integer d*1000+c

44 NxTrLs Extract Next item from (string) Tree node List

45 TrLsApd Append Tree node onto List

4D,xx xSto Pop index off stack, add it-1 to xx and store ToS-1 in that var

4E,xx xLd Pop index off stack, add it-1 to xx and push that var

4F,xx xRng If ToS<1 or >xx replace it with 0, else dupe ToS

These codes are compiled from grammar rules and stored in a resource type Code in the file named for the target language, and indexed in CodX resources. Variables ("attributes" in Pittman&Peters) are accessed by reference number, with dynamic binding to nearest (most recent) NuVar definition on the current call stack. The names of the variables are stored in Vars resource(s), but not directly accessed except when displaying the state of the translation engine while debugging a grammar.

The run-time stack consists of tagged data, in addition to the halfword variable reference number, a halfword Tag, plus a 32-bit integer or pointer Content:

Tag Name Content

- frame

0 Nul null

1 Num integer

2 Str string ptr

3 Tre Tree ref

+ (see notes)

Small strings can be optimized by encoding them with a length byte up to 5 characters right on the stack, instead of allocating dynamic memory for them. The stack frame (opened by each rule/function call, and closed on exit) has a negative tag, the offset to the previous frame; the Content field is a back-link to the Code (and offset) making the call.

See the Virtual Machine document for additional details.

2006 May 10 --Large Base Resources. After putting together the text of the whole Greek NT and the whole Louw&Nida lexicon, I realized that one 4K resource of commonly used words is inadequate. I can do that, but I get too many copies of words replicated into separate resources for each episode, resulting in too many resources to fit into the 16-bit index numbering system. Increasing the base resource size to 16K bytes makes far fewer episode-specific resource items. I did this without altering the basic 4K resource limit by cutting the base resource into two or more 4K chunks at the other end of the numbering spectrum, with a one-word place-holder linked to them. These get reassembled once when the program starts up, so the overhead penalty is minimal.

I still had too many episode-specific resources, so I took the high bits of the episode number as an address extender, which is concatenated onto the 6-bit resource field to form the actual resource number.

Both of these encodings apply to the following resources: GloW, GloX, GrkW, GrkX, ILGW, WrdS, WrdX, (and possibly L&NS).

2006 March 24 -- Interlinear Greek New Testament (public version). This explanation has been moved to the BibleTrans Download Page.

2006 February 22 -- External Document File. Now that I'm actually moving away from HyperCard and have some experience in making this new program work, I'm ready to define a preliminary XML-like format for maintaining the source document files. I'm not an expert in XML, so this almost certainly isn't standard-conformant, but it's a working start.

We have two kinds of tags, those that surround other data, which are of the form <tag>...</tag>, and those that stand alone, of the form <tag/>. Thus every tag syntactially defines its own end, by the "/" character within the angle-brackets. Because angle-brackets are meta-data, I also adopted the common character name "<" for left-angle-bracket, which further necessitates a character name "&" for the ampersand. No other special character escapes appear necessary at this time. I notice that Netscape-generated web pages also use a named character for actual (data) spaces adjacent to tags, but I do not believe it necessary. For now, I will assume that any whitespace outside a tag counts as a single displayed space.

After looking at the unreadable mess it made without them, I decided to leave my high-level node descriptors in the picture data (marked with asterisks in the table below). The document processing software converts these to lines and icons and text in the displayed image.

For tags, I used those I found in web pages when they make sense, and invented something different (capitalized tag name, which appears otherwise unused) when the standard tags didn't seem usable.

<DocX ID=nn>..</DocX> Delimits the text body for one document page

<Target=name/> Defines the name used for linking to this document page or reference

<ref=name>..</ref> Links to another document, and defines the text to click on for linking

<Icon=name/> Links to a named (or numbered) Icon image, displayed inline

<Icon ID=nn name=name>
..</Icon> Defines a named Icon image and delimits the pixel data for it

<Strn ID=nn size=ww/> Links to a Strn resource string, with optional text width

<title>..</title> Delimits the text used in the window title bar

<Drop>..</Drop> Delimits the text displayed in "drop-cap" mode

<sup>..</sup> Delimits the text displayed in smaller font as a superscript

<Text>..</Text> Delimits the text displayed in the default font (normally omitted)

<Face>..</Face> Delimits the text displayed in other defined fonts

<p/> Start a new paragraph

<Tab/> Indent, or leave some blank space within a paragraph

<img height=hh width=ww
align=aa>..</img> Defines an embedded image and delimits the data for it

<Memo>..</Memo> Delimits some explanatory text that is not displayed

<Node ID=ii Icon=nn
col=cc>..</Node> * Defines a tree node element inside an image,
and encloses the associated slot label items

<Slot ID=ii>..</Slot> * Delimits the text of a tree node slot label, linked to another node

<Link ID=ii/> * Defines a secondary link (as a vertical bar) for a multi-tree slot

<LocVH=vv,hh/> Display next image element at coordinate [vv,hh]

<LineTo=vv,hh/> Draw a line from current position to coordinate [vv,hh]

<RectHW=vv,hh/> Draw a rectangle with its top/left corner at current position,
with given height and width

<Color=rr,gg,bb/> Set the current drawing color to given RGB values (0-5 each)

<Ipix>..</Ipix> Delimits the numbers representing columns of pixels
in the current color, used in drawing icons

Defined Fonts. Documents have a choice of nine predefined fonts, the default Text resembling 10pt Times Roman. The complete list, with their 4-bit encoding in DocX resources:

Mono 0 -- 10pt monospaced font
Text 1 -- 10pt plain
Ital 2 -- 10pt italic
Tiny 3 -- 6pt suitable for superscripts and small labels
Bold 4 -- 10pt sans-serif bold, suitable for subheads
Head 5 -- 16pt sans-serif bold, suitable for headlines and drop-caps
Grek 6 -- 10pt Greek font, resembling Symbol
Nano 7 -- 6pt monospaced font
Hebr F -- 10pt Hebrew font (not yet implemented)

2005 August 20 -- Tree Construction. The easy way to build trees is to pick the Louw&Nida concepts from the tagged Greek text or a palette of common concepts, and drag the number over to the tree construction window and drop it in place -- and then have the software fill in all the defaults automatically. For example, the basic narrative structure is the Proposition, where each proposition is defined by its action or state (verb), and then given some number of obligatory semantic roles and optional modifiers. Select a verbal concept and drop it onto a propositional site, and BibleTrans builds the full prop around it. This is particularly important in the case of verbs, because each verb has its own propositional shape. We already verify that the shape is consistent; this extends the same logic to build consistent trees initially.

The other fundamental structure is the (substantive) Thing. Again, select some nominal concept and drop it into a substantive slot and the software builds a Thing node over it. Drop the same concept onto the ThingList node, and it builds a new numbered DefNode instead. Grab the DefNode and drop it onto the substantive slot and it makes a properly linked Thing. Drop the same DefNode onto an existing populated Thing and it inserts only those parts that are still missing, noun number or substantive L&N concept. Most of the ABP Proposition and Thing modifiers are arranged in groups of related and mutually exclusive concepts; the software can be smart enough to notice if you drop one of them onto a Thing or prop which already has a modifier from that group, and to replace the existing concept with the new one.

You can also simply grab an existing tree fragment and drag it to another empty slot in the growing tree -- or onto another window. Hold down the control key and it clones a copy. Drag a tree onto a receptive text window, and it exports the tree into the canonical Allman-Beale-Pittman ontology text format. Drag a valid text representation onto a tree site and the tree is imported. A whole window, episode, or book can be imported or exported by choosing the appropriate menu item.

Dropping a tree or L&N concept number onto an empty slot or on an existing tree node makes it a new subtree linked at that slot position (possibly pushing previous contents down to make space for it). Dropping a tree on the existing tree icon inserts the new subtree at the end of its subtree list, unless the existing icon is an empty placeholder (tombstone), which the drop will simply replace. There is only one slot label for modifiers after the obligatory semantic roles, so to rearrange multiple modifier subtrees requires dragging them to the head icon in sequence, which deletes them from their original position and reinserts them at the end. Dropping a L&N concept number on an existing tree icon may create a new subtree or just relabel the existing tree, whichever makes more sense.

As the tree is constructed, it is dynamically verified and tree nodes may be shown hollow or solid as the consistency checks respectively fail or succeed. At any time you can right-click a hollow node to determine why it failed the check.

You can also drop a Bible reference onto a node, but it will complain if this verse does not belong to the episode. Finally, you can drop an arbitrary text comment onto any node, which can be used to annotate the tree as to exegetical decisions or dificulties, etc. These items can also be dragged to other tree nodes.

Tree nodes can also be copied to the system clipboard, or pasted in place from it.

2005 August 15 -- Tree Verify: An important part of building semantic trees is enforcing a structural consistency, which makes mechanical translation feasible. Here is a list of the tests to be performed:

1. Circular tree corruption. This shouldn't happen unless the program is failing.
2. Excess subtrees. Some nodes take any number of subtrees; others require only those specified.
3. Missing required subtree(s). Each node shape defines specific number of required subtrees.
4. Verb must be first under Proposition or Abstract.
5. Semantic role slot must match subtree.
6. Name must be a noun (blue diamond).
7. Proposition modifiers must be adverbs, subordinating relations, and/or incidentals.
8. Proposition modifier conflict. See documentation 0.121-0.207 for mutual exclusions.
9. ThingList reference under a Proposition must be to a verb in the ThingList.
10. Noun or verb must be first under DefNode.
11. Noun or conjunction must be first under Thing.
12. Things require a ThingList reference.
13. ThingList reference under a Thing must be to a noun in the ThingList.
14. Thing modifiers must be adjectives, adpositions, relative propositions, and/or incidentals.
15. Thing modifier conflict. See documentation 0.221-0.284 for mutual exclusions.
16. Propositional Thing modifiers must have a Proposition or coordinating relation as its subtree.
17. Conjunctions can only have Thing subtrees.
18. Inappropriate subtree under semantic role slot. See documentation for correct form.
19. No node shape specified (tombstone icon).
20. Semantic roles take a Thing subtree.
21. A Content semantic role must be a proposition or relation.
22. Inappropriate star (ABP) Thing modifier.
23. Inappropriate star (ABP) Proposition modifier.
24. Incorrect form for Name proposition.
25. Incorrect form for locative proposition.
26. Missing or conflicting age order for sibling relation.
27. Noun (blue diamond) required in first slot of kinship and social roles.
28. Kinship Whose slot must be a Thing.
29. Relations must have other relations or Proposition(s) as subtrees.
30. Non-transitive event role must not contain Thing. See documentation 0.101-0.120.
31. Non-transitive event role must contain Thing. See documentation 0.101-0.120.
32. This should not be linked to a DefNode in the ThingList.
33. DefNode for this noun is different plurality.
34. Appositive must be same ThingList reference as parent Thing.
35. Only Defnodes allowed in ThingList.
36. Exclusivity is incompatible with plurality.
37. There is no relativizing Thing in this modifier proposition.
38. Coordinating relation must not be under a proposition.
39. Subordinating relation must be under a proposition.
40. Incorrect Procedure/Step combo.
41. Thing required as author/audience.
42. Comparison requires a Thing/Prop modifier as a quality.
43. Comparison quality must occur in both propositions.
44. Abstract, Dative, and Genitive can only occur under 0.300 Surface.
45. We don't have this proposition shape.
46. DirectQuote permitted only under Content slot near Destination slot.

This list will grow over time as more consistency issues get formalized.

2005 August 12 -- Semantic Tree Data: I think we can graphically display the trees using an embedded image, much as we do for document types. Two differences: The program needs to update the tree image dynamically as the user traverses and edits it, and the image needs horizontal scrolling as well as vertical (text documents can be word-wrapped to fit the window width).

The unformatted tree data is a hyper-linked jumble of nodes, each with a node type (represented by one of seven icons) and some flags to retain whether the node is open (subtrees displayed) or closed and if the program is able to detect missing or inappropriately linked subtrees. We also need links to parent, child, and sibling nodes, a Louw&Nida concept number (which comes with an English gloss and node type information), a noun reference number, and optional Bible reference verse and additional comments. This is packed into 16 bytes on the Mac prototype program, but I think we can make it simpler and more robust by expanding the links to full 32-bit integers and allowing the noun# to coexist with the reference verse and comments. The result is eight 32-bit integers per node, allowing up to 120 nodes in a 1K resource. One of the larger episodes in Philippians has 5736 nodes, so either we must now enable very large resources, or else split the episodes up into multiple resources. I'm going to try the latter and see how it goes. With a likely 10,000,000 nodes for the New Testament, a single file would require more Tree resources than we have allowed for. The simple solution is to divide up the corpus into twelve files, 13-29 chapters each: Matt, Mark, Luke, John, Acts, Rom, 1Co+2Co, Ga+Ep+Ph+Co, Th+Tm+Tt+Pm, Heb, Ja+Pe+123Jn+Ju, Rev.

TreeNodes are 8 integers each and come packed 127 in each Tree resource with an 8-word header. The header info:

0 -- ID# of first node in this resource
1 -- Total number of nodes in this resource
2 -- Number of unused nodes
3 -- Link to unused node list
4 -- Node# of last Tree resource in this episode
5 -- Bible verse reference of episode head
6 -- Episode #
7 -- Node# of episode (or book) root

The nodes themselves are in this form:

+0 -- Node type and flags..
+1 -- Link to parent
+2 -- Link to next sibling
+3 -- Link to first child
+4 -- L&N or ABP concept #
+5 -- Bible verse reference
+6 -- Noun reference # or link, or ID of Titl if this is an Episode or Section head node.
+7 -- Note resource ID of comment

The low three bits of the node type select one of the following 8 semantic shapes:


0	1	2	3	4	5	6	7

0: Invalid or undefined
1: Conjunction (joining nouns)
2: Noun ("Thing")
3: Adjective ("Thing modifier")
4: Verb ("Action or state")
5: Adverb ("Proposition modifier")
6: Relation (joining propositions)
7: Structure (non-semantic "incidental")

Additional bits have these significances:

3: Incomplete or incorrect tree (shown as hollow node)
4: Additional tree information not shown (shown with "+" inside node icon)
5: Irrecoverable tree corruption (shown with black stain inside node)
6: Subtrees shown
8: Root node in this image
16+: Node shape code

Every tree node in the entire corpus has a unique node ID, consisting of four 1-byte quanitites: a file number (one of the eleven above), a relative episode number in that file, a relative Tree resource number in that episode, and then a node number in that resource. The middle 16 bits is a unique resource number in the file, and the low 7 bits indexes one of 125 nodes in that resource. Episode numbers are mapped to relative episode numbers in the Teps resources, and in the reverse direction in the Reps and Neps resources.

There are three other bit encodings of node parts, which are also widely used in the resources:

00.BB.CC.VVBible Reference -- one byte each for book, chapter, and verse, with the high byte zero.
00.00.DD.CC Louw&Nida Reference -- 7 bits of domain and 9 bits of concept within the domain (domain 93 has more than 600 concepts, so it spills over into non-existing domain 94), all packed into the low 16 bits of an integer. ABP concepts are domain 0.
0F.EE.RR.NN Tree Node Reference -- one byte each for file (see above), relative episode in that file, tree resource number in that episode, and node in that resource.

Other resources in the tree files:

BkNm -- These are string resources in the main file, the names of the books of the Bible, with ID numbers to match the book numbers.

ChVs -- This resource has the Bible reference (book,chapter,verse) of the last verse in every chapter of the New Testament in the main file, each packed into a single number. It is used to determine when "next verse" really means "next chapter" (or book).

Help -- These are string resources in the main file, the text for so-called "tool tips", and typically given ID numbers to match what they refer to.

Imag -- Each window is formatted as a picture, originally to be stored in the file as an Imag resource, but the images got too long. Image numbers are still created (assigned sequentially), and the last used number is stored in the ImgN resource, but the numbers are not otherwise used.

ImgN -- This is a 1-word number, the highest used Imag number.

ImNm -- The name of this image, used for the window title.

ImTI -- Image Tree Info, generally the tree node rooted in this image, plus the NounList node for the whole book. The relative episode number can be extracted from the node number

Neps -- For each relative episode in a tree file, this lists in Imag sequence, the root node of that image. The format is similar to the Teps resource, but without the Bible verse structure or reference at the front; the first word instead is a reference to the book noun list.

Note -- These are string resources, one for each comment string in the tree.

NotN -- This is a 1-word number, the highest used Note resource ID.

NShp -- This is a set of resources in the main file indexed to match the L&N index resources (see L&NX below), but containing node shape ShpX and tree gloss Tglo reference numbers. NShp #1 has the tree icons for each concept, up to 8000 in a single resource, four bits each. #2-9 indexes the English gloss for each concept, 1000 concepts in each resource, each with the resource number and offset into a Tglo resource, plus an index into ShpX #1 for the node shape. #20-27 indexes the nominal Greek phrase for each concept, 1000 concepts in each resource with the resource number and offset and a word count into a list of Greek words in #64-70, which indexes the individual Greek words (a phrase might be more than one Greek word for a given concept), with the resource number and offset into the Greek word index, GrkX.

Reps -- This resource in the main file gives the base episode number in each tree file (generally one less than its first book episode), and thus maps the relative episode numbers in that file onto GNT episode numbers. The reverse mapping is in the second word of each Teps resource.

ShpX -- Node Shape index in the main file, each item lists a sequence of Slbl items in order. The index at the front gives offsets to the item lists and a total Imag item size for format placement estimation. Each list item has an offset in the Slbl resource, with its text length and pre-calculated display width.

Slbl -- This resource in the main file gives the text of the slot label words in no particular order; they are indexed in the ShpX resource.

Teps -- For each episode, a Teps resource with that ID lists a tree node number for each verse in that episode. The first two words are the Bible reference (book/chapter/verse) of the first verse in that episode and the file and relative episode number in its respective file, then each successive word is the tree info for that verse in that episode, or else 0 if no tree node has been designated yet. If an episode crosses a chapter boundary, the unused verse positions are filled with zeros. Chapters are incremented in bit 8 so that direct lookup is possible without prior knowledge of the chapter size. The reverse mapping from relative episodes to GNT episode number is in the Reps resources of each file. It is conceivable to have more than one image per verse, but generally there is one image per episode; the complete list of images for an episode is given in its Neps resource.

Tglo -- Tree Glosses, the short English label on each tree node, one for each L&N or ABP concept, sorted by estimated frequency of use so the most common are all in resource #1, then alphabetical.

Titl -- Each resource of this type is a string, same as Titl resources in the main file, and which is used as the window title for the root window of an episode. Additional strings of this resource type are used for section titles.

Tref -- This resource (in the main file) groups the Bible books into files. Words 1 to 66 are a file number, or 0 if not present. Word 0 is the offset to a sequence of 4-character file name fragments, indexed by file number. Thus the first file name word might be the string "Matt" if file 1 contains Matthew. There is no file 0, so that position contains a secondary offset to the list of author names, indexed by book number. Thus if the first word of the Tref resource is 72 and word 72 is 88, then word 128 (88+40, because Matthew is book 40) would be 244, which is the L&N concept in domain 93 for the name MaqqaioV (Matthew). This resource is duplicated in the master resource file and each tree file.

WhyH -- These are string resources in the main file, the text to explain why a tree node is hollow.

2005 June 30 -- Document Resource ID Numbers: We need a consistent numbering scheme for DocX ID numbers:

20000+n -- Sequentially assigned to successive ABP and L&N definitions, one per concept [Note 1].
28000+n -- Sequentially assigned to successive popup notes in ABP and L&N definitions, one each.
29900+n -- Sequentially assigned to ABP and L&N ToCs, for each domain n. 29999 is the master ToC.
30000+n -- Sequentially assigned to successive episodes in the Greek NT, one each episode [Note 2].

Notes:
1. The correspondence mapping between concept numbers and resource IDs is stored in L&NX resources.
2. The mapping between chapter&verse references and resource IDs is stored in Epis and Bref resources.

2005 June 2 -- Parse Codes: I developed an internal representation for the Greek parse codes (whatever their source) that can be stored in a unique single letter code for each item. It's not particularly mnemonic, but people don't need to see it.

The first letter selects the word class (as capitalized Bold):

Verb
Determiner (article)
adJective
Noun+proNoun
adverB
Conjunction
Preposition
Xother

The first four of these require one or more additional information items:

Mood Indic,Subj,Opt,Imper,Inf,Part JXOECZ

Tense Pres,Imperf,Fut,Aor,Perf,Pluperf WIUBRL

Voice Active,Middle,Passive,Mid/Pass QKTH

Person 1,2,3 123

Number Singular,Plural SP

Gender Masc,Fem,Neuter MFN

Case Nom,Gen,Dat,Acc,Voc YGDAV

2005 May 24 -- Text Codes Revisited: This is in two parts, first the resource types for encoding all this stuff, then a slight renumbering of the text codes previously decided. Louw&Nida (L&N) definitions are organized into domains and subdomains, each with their respective domain (and subdomain) titles. In the prototype Mac version I had an elaborate link to the text of these domain titles, but I think it sufficient to set them up as string resources linked in the text body. This requires a new string resource code in the body text. We also need a new code for the Greek New Testament (GNT) interlinear cell reference. These changes have been retrofitted into the previous documentation below.

Resource Types:

Bref -- This set of resources, one for each book and numbered by book number (Matthew is book 40), mapping chapter and verse back to episode number. This is the inverse function of the Epis resource. The first word of each resource is (one less than) the episode number of the first verse in this book, and all the verses are encoded as one byte increments to it, four verses to an integer. The largest number of episodes in any single NT book is about 160 (two gospels), so the offset easily fits in a single byte. No chapter in the GNT has more than 75 verses, so the verse number fits in 7 bits. The highest chapter number (28) fits in 5 bits, for a total of 12 bits, thus fitting in the 4K resource size limit. A few of these decisions will need to be revisited when we get around to doing the Old Testament, but that is some distance off, probably after resources lose their 4K limit.

DocX -- This is the basic text data resource, one resource for each window's data. GNT data is organized by episode (see Epis, below), L&N data is organized by concept number (see L&NX below), and other documentation is sequentially numbered from 1; see number ranges above. Window titles for each document resource are contained in Titl resources with the same ID number. A document resource consists of a sequence of integer codes (see Codes below) representing the text to be displayed.

Epis -- This is a one-resource list, in DocX ID# order, of the Bible reference numbers (book,chapter,verse) identifying the first verse of each "episode" (group of verses with a section title). The first number in this resource is the starting resource ID# to which the remaining positions are added. Thus the first episode in the GNT (Matthew's geneology) is coded in position +2 of the resource as 2621697 (book 40, chapter 1, verse 1, coded as hex 0x00280101), and the position (+2) is added to the 30000 value in offset +0 to form the resource ID 30002, which is the DocX resource containing the text of that episode and the Titl string naming that episode. There are fewer than 1000 episodes in the GNT, so the entire index fits into one resource. The inverse function is Bref. A null episode is allocated in the sequence for the start of each book, but there is no corresponding DocX resource.

GloW -- Each resource of this type is filled with gloss words used in the interlinear GNT, aligned on 4-byte (one integer) boundaries, and indexed in GloX resources. Resource #1 (see Base Resources for encoding details) contains the most frequently used word strings, and the rest are grouped to contain whole episodes to the extent that any gloss is used only in a single episode. This makes the preparation of the window text more efficient, requiring only two (or perhaps at most three) such resources in memory at one time.

GloX -- This is an index of recource ID and offset numbers for each gloss text string in GloW resources. A compact 16-bit numerical index selects one of these entries, which in turn contains the resource ID and offset in the GloW resource, plus the length in both characters and pixels (the latter for ease in measuring word-wrap). Where a single Greek word gets several different glosses, these are grouped together, so that the particular gloss can be selected from the list with a small (one-byte or less) offset. GloX items are packed (from most significant) as [pixel width:10] [byte length:6] [resource:6] [offset:10].

GrkW -- Each resource of this type is filled with Greek words, aligned on 4-byte (one integer) boundaries, and indexed in GrkX resources. Resource #1 (see Base Resources for encoding details) contains the most frequently used word strings, and the rest are grouped to contain whole episodes to the extent that any word form is used only in a single episode. This makes the preparation of the window text more efficient, requiring only two (or perhaps at most three) such resources in memory at one time. The lexical forms ("lemmas") are included in these resources.

GrkX -- This is an index of resource ID and offset numbers for each Greek word in GrkW resources. A compact 16-bit numerical index selects one of these entries, which in turn contains the resource ID and offset in the GrkW resource, plus its length in both characters and pixels (the latter for ease in measuring word-wrap). GrkX items are packed like GloX resources as [pixel width:10] [byte length:6] [unused:6] [offset:10].

Icon -- This is a sequence of integers representing the colors and pixels of that color in a small (32-pixel square or less) icon for display in a text line or within a picture. See Icons below for the specific format.

ILGW -- (InterLinear Greek Word) This is an index of reference numbers for each Greek word in the interlinear text. Each entry consists of four 16-bit reference numbers and four 8-bit width numbers. The first two reference numbers are to the inflected form of the Greek word and its lemma, both indexes in GrkX resource(s), and the next two are the resource offset of the parse code (see Pars below) and the gloss in the GloX resource. If there are multiple glosses for this Greek word, the gloss number is the first of them (the others being consecutive numbers following it). The four widths represent the respective display widths of the four parts (in separate lines); a single number is not possible because the decision of which lines to display can be made by the user at runtime. The fourth width varies according to the selected gloss, so it normally occurs in the second word of the document reference code, along with the L&N tag and its width.

L&NS -- These resources give sequences of possible L&N tags for untagged Greek words in the interlinear text. A single L&N concept number fits easily into 16 bits with code space left over; code numbers greater than 49152 (non-domain 96 or higher) are used to index one of these sequences. These index resources encode the resource number, starting offset, and string length for each sequence, which are listed in the low-numbered resources of the same type.

L&NX -- These resources index the L&N concept numbers. Although not used directly in formatting the document windows, they are programmatically used to translate into DocX resource IDs the L&N numbers embedded in the interlinear GNT. The packed L&N number consists of a 7-bit domain number and a 9-bit concept within that domain (domain 93, Names of Persons and Places, has more than 511 concepts, so the number extends into non-domain 94). The L&NX resource has 94 values (domain 0 is used for the ABP extensions) representing the starting sequence number for the first concept in that domain. Another set of 8 resources gives the reverse mapping, from sequence number to L&N number.

Pars -- This is one or more resources containing the parse code strings for the interlinear GNT, aligned on 4-byte (integer) bounds. If they don't all fit in one resource, the base resource contains the most common codes, and one or more additional resources contains all the additional parse codes for some set of episodes. Another resource of the same type and format consists of one-byte resource ID numbers for string resources (also type Pars) giving spelled-out names for the codes.

Strn -- This resource contains the text of an entire string to be embedded in the text without wrap. It is normally used in a L&N definition panel for domain or other titles.

Targ -- This is a list of hot-link target references, in numerical order by reference number in the DocX recource. These are currently defined to be a 4-byte type code (such as "DocX") and a resource ID and offset packed into a second 4-byte integer. Clicking on a hot link associated with a particular target opens (or brings to front) the respective DocX page, or else sends that message to the window owner for processing. The exact format of the Targ resource is still under consideration.

Titl -- Each resource of this type is a string, which is used as the window title for whatever document is displayed in that window. The resource ID of a Titl resource is the same as its matching DocX. See also Epis for the index of GNT (or other Bible versions) titles.

WrdS -- Each resource of this type is filled with English words, aligned on 4-byte (one integer) boundaries, and indexed in WrdX resources. Resource #1 (see Base Resources for encoding details) contains the most frequently used word strings, and the rest are grouped to contain whole documents to the extent that any word is used only in a single document. This makes the preparation of the window text more efficient, requiring only two (or perhaps at most three) such resources in memory at one time.

WrdX -- This is an index of resource ID and offset numbers for each English word in WrdS resources. A compact 16-bit numerical index in a DocX resource selects one of these entries, which in turn contains the resource ID and offset in the WrdS resource, plus its length in both characters and, for each of a half-dozen supported fonts, pixels (the latter for ease in measuring word-wrap). The first word of WrdX items is packed (from most significant) as ["Head" font pixel width:10] [byte length:6] [resource:6] [offset:10]. The second word contains one-byte pixel widths for "Bold", "Tiny", "Ital", and "Text" fonts.

2005 May 13 -- Realizing that I don't have IP Rights to most of the resources needed to run BibleTrans, I embarked on a two-pronged effort to remedy the situation. First and best is to license the materials from the rights holder(s), but it's hard to find the right person, and harder still to actually work out a license. Perhaps God or one of His other servants can help move this along.

Failing that, and/or while waiting for it to happen, I can use public domain materials and the results of my own (and hired) labors. I searched the internet and found a web site with the 1881 Greek text of Wescott&Hort, with more recent Nestle/Aland changes added in the apparatus; I removed the NA additions to recover the pure WH text, which is in the public domain. For parse tags and lexical forms ("lemma") I compared three independent sources and used the parse information where there was agreement (evidence that the parse is common knowledge rather than proprietary), leaving blank any forms where there was disagreement. We (I and my employee at the time) spent substantial effort designing the node shapes corresponding to the L&N concept numbers; part of that effort included coming up with a one- or few-word English gloss. I can now back-substitute that into the interlinear Greek text as the English gloss on the Greek words; I can also use those glosses in a rudimentary representation of the Louw&Nida lexicon until I can get that licensed. Assigning L&N concept numbers to the Greek text was all done within the BibleTrans project, and therefore I own the rights to those assignments. There was at one time (perhaps still is) another project to tag the GNT with L&N numbers, but my tools were more efficient, so at the time I deemed it more productive to have my own team do the job, and to ignore the other effort.

I subsequently learned that I cannot get a blanket license to IP rights, but only a per-unit royalty license. That means I cannot post licensed material to the internet. So it looks like I will maintain two versions of the documents: one derived entirely from public domain (and my own) documents, for posting and trial use, and the other licensed with up-to-date Greek text and the full Louw&Nida lexicon, for actual use by linguists.

2005 May 9 -- Interlinear Text encoding is slightly more complex than plain linear text. With linear text, each item has the same nominal height (actually the height of the tallest item in the line), and its width determines the spacing to the next item (word) in the line, or when to continue the line onto the next with this item. Pictures are complex items, but for spacing they are treated as atomic (indivisible). In one sense, interlinear text word-cells are atomic blocks, but like the partly-inflected nouns and verbs in plain text words (see below),they are constructed somewhat on the fly from smaller file components. Also, except for icons (tiny pictures that fit within the line height), pictures are placed outside the flow of text wrapping. Interlinear text cells, however, need to participate in the flow.

The first generation (Mac) BibleTrans program had some checkboxes on each Greek text page for the user to select which lines of information should be visible, and the selection applied to the whole episode in that window. Experience suggests that there is not much need to be changing the amount of display on a regular basis, and that a Preferences panel can set defaults, and a contextual popup menu can alter the current page when needed. Thus we can think of the interlinear word cell as a (relatively) fixed height and composition, but (usually) taller than plain words. Each cell represents exactly one Greek (or Hebrew, implicitly every place instead of Greek) word, even though the English gloss for it may consist of several words, and an untagged item may offer several L&N suggestions. The cell word-wraps with the single text word, regardless of the size of its supporting gloss and tag lines, the largest of which determines the physical cell width.

Current thinking is to build a frequency-sorted index of all the fully inflected Greek words, listing in that index not only the spelled-out form, but also its lexical form ("lemma") and parse code, and if appropriate, the English gloss. Like the regular document words, we can include for ease of word-wrap decisions, the pixel width of this cell. L&N tags are just numbers, but (in the case of context-sensitive glosses) the English gloss phrases require a separate index, which can be embedded by default in the word cell item and/or included in the text along with the L&N tag number and the cell index number. Even in the case of context-sensitive glosses, the number of different glosses for each Greek word is probably under a dozen.

There are slightly less than 20,000 different inflected Greek words in the New Testament; this number can easily fit into a standard 16-bit integer. A second 16-bit (or smaller) number indexes the lexical form of the word from the same list, and another can index all possible parse codes. There are just under 8000 L&N concept numbers, again indexable in less than 16 bits. Alternatively, the parse codes and L&N numbers can be fully represented in one 32-bit word each. If we use the node shape gloss linked in BibleTrans to the L&N concept number, then we need no additional bits to encode the English gloss in the Greek interlinear text cell, resulting in a minimal cell storage of 64 bits.

I decided to put the L&N tag in the body of the text, so it can be individually set for each Greek word. The remaining 16-bit slot in the cell is used to index a sequence of one or more possible English gloss words for this Greek word, the actual selection packed into the body text. Thus the same Greek word can be glossed in any of several ways if needed, on the assumption that the number of different glosses for any given Greek word form is relatively small. Adding precomputed pixel widths for each part gives us the following 3-word ILGW cell format:

GkwdLemm Gkwd: index to inflected Greek word GrkX
Lemm: index to lexical form, also in GrkX

ParsGlos Pars: index to parse code Pars
Glos: index to set of glosses GloX

0cPwLwIw 0c:   additional pixel width if caps
Pw:   parse code byte length
Lw:   lemma pixel width
Iw:   inflected word pixel width

Like the other word indexes, this is sorted by frequency, so the most common items are contained in a single Resource #1 (see Base Resources for encoding details), and the additional items are collected in a single resource for each text episode. An episode represents a translation unit, typically a story or topic, typically determined by a section heading in the Greek text. In the case of the unedited public domain text paragraph breaks from the (1901) ASV are used.

The interlinear cell reference in the document text is packed into four 16-bit half-words: The first word is encoded similarly to plain English text in the upper bits, with capitalization, punctuation, and a 6-bit gloss sub-index in the upper half, over the cell index number in the low 16 bits. The second word packs two one-byte widths (for the L&N tag and the gloss) above a packed L&N tag. Read the following packed data as numbers; on a Little-Endian machine like the PC, the bytes of each word will be stored in memory in the reverse order:

4pgaCCCC p: 2-bit punctuation
g: 6-bit gloss sub-index
a: 1-bit capitalization
C: 16-bit cell index, see also ILGW

nneeLLLL n: width of L&N tag
e: width of gloss
L: L&N tag (see L&NS)

For a partial explanation of how some of these structures are constructed, see L&N Tags in "Document Preparation".

2005 May 5 -- Picture Encoding consists of a sequence of coded integer descriptors following a 2-word header. The first word of the header contains an integer length of the picture data, plus some flag bits to specify how the picture fits into the text (left- or right-flow, or else centered in its own paragraph). The second word is a (height,width, 16 bits each) size of the image rectangle. A related (unused) first header word could specify the resource ID where the image is to be found; the rectangle word gives the dimension so that page layour need not fetch the resource solely for getting its size. Except in the case of the resource link, the image data follows the header, and is terminated by a word of all zero. The same format (without the first header word) is used to encode Imag image resource data. This format is optimized for so-called vector graphics (straight lines and filled rectangles, with occasional pieces of text and small icons) rather than photographic images.

The following types of coded data can be used in the image (see also text document forms), where vertical components are 16 bits, and horizontal components are 12 bits:

0,h,w Filled rectangle height v, width h

1,n,s,c Text font style s, color c, for n characters following

2,v,h Jump to coordinate vertical v, horizontal h

3,v,h Draw line to coordinate vertical v, horizontal h

4,i Icon, id i, usually a 4-character name

5 Icon

6,n,c Pixel bits, color c, for n column words following

7,h,w Pixel bytes following, height h, width w

Colors in this image are specified using a 1-byte 6x6x6 RGB encoding, where black is the numeric value 0, and white is the numeric value 215, and the red value ranges from 0 to 5. Minimal green value is 6 and minimal blue value is 36. Byte values above 215 are not valid colors in this model, but 255 may be used in a byte run to leave the pixel unchanged. This is the same model used in the IttyBittyStackMachine architecture. Use a text item of zero length to set the color for a subsequent rectangle or line.

Text characters and pixel bytes are in file-byte order, without respect to whether the host hardware is Big-Endian or Little-Endian. The integer-coded data use the natural representation for integers in the host hardware.

Pixel bits are encoded in 32-pixel vertical slices, one slice per integer word, with the least significant bit at the top. This is the same encoding used in the IttyBittyStackMachine text, same as icons (below).

Icons are encoded using a sequence of integers representing the pixels in a single column up to 32 pixels tall, from left to right. This encoding was originally defined for rapid text display in the IttyBittyStackMachine architecture, where it is fully described. Icons are a little more complex than plain text, so we encode them as a header word defining the size of the bounds rectangle, followed by one or more pixel planes, each consisting of a color+size word followed by the pixel data. The least significant byte of the color+size word is the color, and the next byte up is the number of pixel columns in this color plane. The top two bytes of this word should be 0; they are reserved for an optional pixel offset, down and to the right. The final word of the icon data is all zero, which also decodes as a 0-pixel plane of black.

In most cases icons are used repeatedly, both in the text and in various parts of a larger image, so we store each icon in its own resource of type "Icon" (see Resources below).

2005 April 26 -- There is a small number of Text Types in document windows that we need to support, plus a comparably small number of image types:

a. Plain text, for which we use plain black 12pt Times Roman, or a comparable font.
b. Bible verse references, hyperlinked to the Greek or one of the supported Bible translations, purple?
c. Hyperlinks to other document pages, the same font, but blue underlined.
d. Action hyperlinks, generally a different color, like green or red.
e. Local Emphasis, using 12pt Times italic.
f. Subheads, using 12pt Helvetica ("Ariel") Bold.
g. Greek text, for which we adapt the generally available Symbol font, still in 12pt.
h. Some kind of bigger, blacker Headline font for titles, for which we use 18pt Helvetica.
i. Superscripts, because Louw&Nida uses superscript letters to distinguish word senses, 9pt Times.

This program is about the Bible, so Bible reference hyperlinks are strongly connected to the various Bible support tools as we can make available. The prototype Mac version of BibleTrans defaulted on a 2-click on a Bible reference to the Greek text, or (if from the Greek) to the semantic tree node, with an alternative contextual menu from which you could also choose from any installed translation (I even did a French translation in one demo). Modern web browsers (which did not exist when the Mac prototype was being developed) link on a single click, so that is the standard people expect. Another improvement to hyperlinks is the underline and the ability to pop up the context menu on a click-and-hold or right-click.

The parse codes in the interlinear Greek text use a cryptic notation that a proficient user becomes accustomed to, but it's easy to forget what these cryptic letters stand for; clicking on the green parse code pops up a little window with the codes spelled out. They also appeared in the "Balloon Help" mode, which is something like Windows "tool tips", except you could turn the annoying buggers off on the Mac when you didn't want them getting in the way. I didn't in the Mac prototype, but we might standardize the green hyperlinks to mean that it just pops up a little window.

An important usability feature of BibleTrans is its direct manipulation control, click and drag to do whatever needs doing, rather than typing arcane command lines. Hyperlinks are an important part of this. Opening a related document is just the tip on the iceberg. The action links initiate a subprogram activity, possibly involving many conceptual steps.

In addition to textual content, we have visual images, in-line icons, and larger pictures set off from the text, either in a separate "paragraph" or else with the text flowing around it left or right. Because we use a graphical tool for building the semantic database trees, it's necessary to explain these structures graphically. Anyway, pictures also help to organize ideas.

Two useful ways of entering certain classes of linguistic data are by on-off checkboxes, and by fill-in tables. These work something like pictures for display purposes, but they are linked to user data and hot-linked to specific actions. The Mac prototype also had dynamically generated images for direct manipulation of data, notably by dragging data elements into a particular order, like so:

Another example connects up data elements in a user-specified way:

We still want to do these.

Encoding We have defined above six different font faces+sizes, which can be encoded easily in three bits with room left over for Hebrew and a small system font like that used in the picture above. A similarly small number of bits can encode the type of hyperlink and maybe a choice of half-dozen colors. Current thinking is to encode the data in the file in these type codes:

0,ff,w Text word w, with punctuation & inflection bits ff; 00 is end of doc

1,f,c Single (or double) character c, with punctuation bits f; 1,0 is paragraph end

2,f,n Decimal number n, with punctuation bits f

3,f,b,c,v Bible reference (book,chapter,verse), with punctuation bits f

4,f,g,n Interlinear text cell (ILGW) n and gloss increment g, with punctuation bits f (2 words)

5,f,s,n Strn resource n, with punctuation bits f, and pixel width s; s=0 is whole line

6,t,s Hot-link reference number t, with color & style bits s; t=0 is font change

7,r,i Icon number i, size r

8+ Picture encoding

9+ Active Image

The first six of these codes contain flag bits for word spacing and punctuation, as follows:

26,25 Punctuation
24 Omit word space
20-23 Suffix codes (English text words only)
16 Capitalized

The interlinear Greek text has four parts that are fixed for the particular word being displayed (the textual form of the word, its dictionary spelling or lemma, the parse codes, and an English gloss -- but some interlinear texts vary the gloss by context), plus one line of information added by BibleTrans for analysis, which is the Louw&Nida tag for this word in this context. The L&N tag may be different for the same word in different contexts, so we cannot embed it in the word list. Thus it requires a separate index item in the text proper. When the interlinear text varies the gloss according to the context, we need a separate index item for the gloss also. Therefore code 4 is actually the first of two integers needed to encode all this.

2005 April 15 --Line Height in document windows could be made completely general, but it turns out we really only have three distinct font sizes: A large title font (Helvetica 18), a small superscript font (Times 9), and everything else (Times 12, plus italic and Courier and Symbol for the Greek, all the same size). The titles generally occur on a line by themselves, but we can have enlarged chapter numerals at the beginning of a paragraph, in drop caps mode. With a more general subject matter it might make sense to offer more variety in textual styles, but this is BibleTrans, not Principia Mathematica.

Therefore we plan to support only one basic line height, suitable for standard text with raised (smaller) superscripts. The title font would drop below the line, and if it occupies more than half of the total line length, the whole line becomes double height; otherwise the right end of the initial title font run temporarily becomes the left margin for the next line to create the drop-cap effect. We plan to use the same temporary left and right margin support to wrap text around pictures embedded in the text, when they are taller than the standard line height. See Defined Fonts.

2005 April 13 -- Word Encoding in Doc Files can be more dense if we extract common suffixes into some upper bit flags. We could do the same with prefixes, but the English language doesn't inflect words very often at the front, except for capitalization at the front of sentences. The suffixes come in two categories:

a. Noun (number and case) and verb (number and tense) inflections, and
b. Sentence punctuation.

Verb inflections include the following common indicators of present and past participle (tense) and third person singular:

1. -e+ing
2. -y+ied
3. -y+ies
4. +s
5. +ing
6. +ed
7. +es -- if the verb ends in a sibilant like 's' or 'x'
8. +d -- if the verb ends in e

Noun inflections include the following common indicators of plural:

3. -y+ies
4. +s
7. +es -- if the noun ends in a sibilant like 's' or 'x'

and the following indicators of possessive (genitive) case:

9. +'s -- singular and some irregular plurals
10. +s' -- plural

The following common punctuations are worth encoding in the final word of their clause:

1. period
2. comma
3. semicolon

Other punctuations occur less often, so the savings is minimal. These three can be encoded in a 2-bit field. The inflections can be encoded in a separate 4-bit field. Another bit encodes the capitalization, for a total of seven bits.

2005 April 8 --Doc Files will be maintained internally in BibleTrans as binary resources (see Resources, below) coded for rapid text wrap and display. Since resource files are not portable between Big-Endian and Little-Endian platforms, we need some kind of doc file import function. It makes sense to use a simplified HTML (perhaps eventually XML) markup format for the external file. Then doc preparation can be done in almost any web authoring tool. This also eases the burden of changing the internal representation, should that need refactoring. We already have a documented external representation (ABP) for tree data.

For rapid reflow when loading or if a window size changes, and for a compact file size, experience shows that the most efficient resource representation of the document text is by indexed words. We build a word list of all words in the document, sorted by frequency, and store these in one or more separate resources. The text itself then consists of a string of numbers indexing the word list. Although not particularly robust against a determined pirate (nothing is), this encoding also protects intellectual property (many of the document files are restricted by copyright) from casual theft by giving it protection under the Cyclopsian* Digital Millennium Copyright Act, which criminalizes otherwise reasonable research. The advantage for processing is that we can measure all the word widths at one time and store them in the table, so determining word wrap is very speedy. Word searches are also fast and simple.

Another advantage is that interlinear text tracking is trivial: an interlinear Greek text is stored as the Greek text word indices only, then if the English gloss or the grammatical parse is to be displayed with the Greek word, they occupy the same word cell (only taller), and word-wrapping still works exactly the same. It even works correctly for wrapping right-to-left languages like Hebrew within the body of an otherwise left-to-right English document, as the interlinear glosses track the Hebrew words they are part of. The cut point on a split line is calculated exactly the same for both directions, but the Hebrew words are displayed in reverse order. This gives the startling (but correct) result that a line starting in English then switching to Hebrew which extends into the next line, the first Hebrew word is at the end of the first line and filling out to the left until it meets the English text, then continuing at the middle of the next line to the left, while the English resumes at that same middle point continuing to the right. Partial selections are really weird, but BibleTrans has not needed to support copying document text, so this is not a problem for us.

The body text of a document in its resource should consist of a sequence of runs, each with a header word describing the run length and the type and style of content, followed by the word index numbers in that run. Style is again indexed in a table of (a small number of) standard styles. For hyperlinked items, one or more additional words contain the link index (resource number). Pictures and icons are like text runs, except the header contains different information, either a picture encoding type or a defined icon number.

BibleTrans document pages were purposefully kept relatively short. If this tradition is preserved, then the in-memory storage of a document on display is pretty modest: a pixel map of the formatted page, plus a structured list of the hotlink rectangles for taking action when the user clicks on one of them. The resource data can be dismissed -- and reloaded if the window is resized: because all modern operating systems cache file data, reloading it is pretty quick.

* The 16th-century theologian T.Norton described the Greek mythological creature Cyclops as "warring against God," probably on account of its bellicosity in the literature. The proponents of DMCA certainly fit that description in their aggressive prosecution of its penalties.

2005 April 7 -- Fonts: The BibleTrans doc windows only need four (eventually five, with Hebrew) fonts. The body text we have always done in Times Roman, with Helvetica (Win32 calls it Arial) section heads, embedded Greek using Symbol, and some fixed-pitch items (mostly only the parse codes) using Courier. Win32 apparently has these four fonts in the default system, but it's unclear how to get access to them reliably. Hebrew I will need to supply myself, most simply as a bitmap font I can display as a pixel image. I need to do pixel images for small icons and larger pictures anyway. I can do all the text as pixel images, but the hassle of maintaining the other four fonts as bitmaps seems like a lot of effort for the value. On the other hand, the hassle of mixing native text with graphics combined with the baroque Win32 system calls for font access leans more toward a homebrew solution. Fortunately, I can put all this inside the framework, for easy refactoring if I subsequently decide to revisit the decision. See Defined Fonts (above) for specifics.

2005 March 28 -- Resources: the Mac has them and implemented them well; the PC embeds read-only resources into program files, but because they cannot be updated at runtime, they are not as useful as on the Mac. Resources are too new for the doddering Unix, which stores everything inefficiently in a zillion tiny text files (yes, even Apple's hacked-up unix does that). To get the functionality of the Mac's resource files in BibleTrans, I need to implement them myself. They are generally useful enough to be in the system, but there are some trade-offs.

First, Why a resource file implementation at all? Why not a zillion tiny text files, the unix way? Logically, the two ideas are approximately equivalent, except that resources can be accessed by number as well as name, and can be grouped by type. This is effectively a strong-type way to do things, while the unix files are untyped. Text conversion costs extra time -- not a big thing, but every little bit adds up. It's one of the reasons Unix is so slow, compared to modern operating systems. We could write a zillion tiny binary files, but that wouldn't be the unix way. Just having a zillion tiny files is very inefficient of disk space (although with modern drive capacities that might be less of a problem), because files are generally allocated in large blocks. Searching for a resource by name is probably no cheaper than searching a directory for a file by name, but fetching a numbered resource can be much faster. Just the mechanics of opening and closing files is a substantial performance limit, but if you don't close them, you bump into limits on the number of open files.

Then there is the integrity problem. It is much harder for a casual or bumbling user to corrupt a monolithic binary file than to lose or corrupt one of a zillion tiny files. Files are copied one-by-one, and a transmission error could take out a file -- and warn the user, who might reasonably be tempted to let it go ("It's only one of 29,000 files") -- while a monolithic file lost in transmission will be corrected immediately because the program cannot run without it.

About the only thing going for the unix way is its initial simplicity. Unix generally has a low buy-in cost, but a much higher TCO (total cost of ownership) than the more modern (designed) solutions to the same problems.

The resource file implementation I chose uses a fixed-size block for allocation simplicity, with the assumption that tiny resources can be packed several to one block, while anything bigger than a few hundred bytes will get its own block. The design calls for resources larger than 4K bytes to be split across multiple blocks, but the code to make that happen is somewhat more complex than if I prohibit big resources. The existing tree for Php 3.12 is 5736 nodes, which at 16 bytes per node is almost 100K bytes (25 4K blocks). Do I make big resources, or split them in BT? Considerations:

a. The system should not be putting arbitrary (4K) limits on the program;
b. Solving the blocking problem once in the system (twice, if you count the framework separately) is better than dealing with it over and over in the application programs;
c. T2 is not friendly about growing array blocks;
d. If blocked in the program, we can load only the parts we need (faster);
e. We can always add blocking to system later.

We might could add some kind of redim() function to T2 to deal with (c), but I don't know how at this time. Load/save time (d) is probably not a big deal, given ever faster computers. The deciding factor looks like (e), to implement a 4K max at this time in MOS/frame, and to use arrays of blocks in BT, which looks like the shortest time to market with the fewest irreversible decisions.

2005 March 21 -- When I gave up fully deploying this on my MOS operating system, I thought I was also giving up the ability to run full regression tests and (especially) unit tests, which I assumed needed scripting and a scriptable compiler. Scripting is an important (and working) part of MOS, but lesser systems don't have it [I subsequently learned that Windows does have what they call "Windows Scripting Host" but (Microsoft documentation being what it isn't) it's not yet clear what WSH can do]. However, since I still plan to do most of the development on the Mac (saving the PC for the final build and associated tests), I think I can still use the script automation features of the existing and partially functional MOS for regression tests. That doesn't test it on the PC platform, but it does test most of the code. Furthermore, I can actually put unit test code in the BibleTtrans program, and just switch it off for release. This is already bearing fruit. Well, not much.

2005 March 18 -- Since MOS is still unable to support full software development (and the PC is just too hard to use for anything), I must continue using my Mac-based software development tools. Therefore I chose to implement the T2 -> C translator in HyperCard. It took one day to get it working and usable. HyperCard, although long ago killed by Apple, is still the development tool without peer!

2005 February 26 -- I gave up on making the operating system a formal part of BibleTrans. It's too big. Instead I broke it off as a separate project, eventually to become the tool system I work in, but not holding up BibleTrans. I will still write and test BibleTrans in Turk/2, but then deploy it through a Turk -> C translator. Cobbling together code from MOS and the PC-native version of IBSM, I had the framework up and running in about two weeks (see screenshot).

2004 August 12 -- I bought Qt, a commercial framework for implementing C++ programs cross-platform (but mostly on the PC). Subsequent experience suggests that despite the fact that Qt has better documentation and support than Microsoft's Dot-Net framework, compiling direct to the Microsoft system-native Win32 interface is consistently (even surprisingly) about twice as fast as working in Qt. Qt also has the same run-time versioning problems that afflict Java, whereas a program compiled direct to Win32 can be deployed as a monolithic program file which just runs on anything from Win95 on.

2004 April -- Realizing how hard the PC is to work with, I began work on a Mac-like operating system as a basis of future development, written in Turk/2. The system works, but the software development environment to deploy it fully is still a long way off.

2004 March 19 -- The PC is the only game in town, so I bought one. Blech. Just goes to show that market penetration is not a function nor indicator of quality.

2004 January 21 -- I was informed that the university is terminating my employment; although and because the administration tried to conceal it, I was able to deduce that their reason for termination makes me unemployable. I explained this to the provost, and he didn't want to admit it, but he could not deny the facts. I took this as a sign from God to start BibleTrans back up.

2003 December -- Elizabeth Miles completed revision of her encoding of Philippians and Luke 1-4. She has done excellent work. Thank you, Elizabeth! She is scheduled to leave in early 2004 for Switzerland to learn French before being assigned to work with SIL in Africa.

2003 March -- Designed a new programming language "Turkish Demitasse (a stronger brew than Java)," also known as Turk/2 or simply T2, to overcome some of the remaining serious shortcomings of C/C++ not repaired by Java. Java only runs in emulation and has the usual version modification problems of a run-time not built into the native operating system that comes with a computer; C/C++ compiles to native code, but the language is poorly suited to building robust software in reasonable time. Its known and many failings are largely responsible for system crashes and the computer virus epidemic which shows no signs of abating to this day. T2 is designed to not have those problems.

2003 January -- Apple officially kills the Mac. They no longer sell computers that run the MacOS except in emulation mode, which will probably eventually go away also. Their current product line runs a 35-year-old system called Unix, with a heavy layer of pancake makeup to hide the wrinkles. Apple has a shrinking 2% of the computer market.

2002 March 9 -- BibleTrans International self-destructed. I lack the leadership and management skills to assemble and motivate a team of people with a long-term commitment to computer-assisted Bible translation. I took this as a sign from God to go do something else for a while, in this case teach college. With the generous (but unfortunately now triple-taxed) help of Norm&Sally Larsen, I continued to keep Elizabeth Miles on my payroll, encoding Philippians and the first four chapters of Luke.

2002 January 17 -- Without any supporting evidence, and in flagrant denial of more than 2,000 years of unbroken historical experience, the Internal Revenue Service [Ref# 50-12804] determined that Bible translation into languages that do not yet have it is a profit-making business. Your tax dollars at work.

2001 October 19 -- BibleTrans formally presented to Bible translation community at "Bible Translation 2001" conference in Dallas. One of the students in attendance, Elizabeth Miles, signed up to encode database on her PC using Executor, the Mac emulation package.

2000 May 8 -- "BibleTrans International, a California nonprofit corporation, came into existence for the purpose of proclaiming the gospel of Jesus Christ by using computer technology to facilitate the translation of the Bible into languages where it does not yet exist."

1999 August 16 -- Met with Steve Beale, Tod Allman and other SIL technical people in Dallas to develop what we called the Allman-Beale-Pittman encoding for a Biblical ontology, which is still used in BibleTrans, and to commit to encoding the book of Philippians. Beale and Allman subsequently went on to do other things.

1999 June/July -- Adapted BibleTrans to work in ARDI's Executor, a Macintosh system emulator that runs on the PC. It's only a stop-gap, until I can get around to converting the whole program to native PC code. Until then I can maintain a single code base for cross-platform deployment.

1998 October -- Demonstrated actual translation into non-European language (Tuwali-Ifugao) at SIL/JAARS Computer Technical Conference. It was implemented on Macintosh in compiled HyperTalk.

1997 Summer -- I connected with Steve Beale, then doing doctoral work at New Mexico State University, who suggested using Louw&Nida semantic domain concept numbers as an ontology instead of inventing my own. It was a wonderful idea still being used in BibleTrans. Some time after the 1999 meeting in Dallas, Steve began to insist that "Louw&Nida is not an ontology," which may be strictly true, but it does cover 80% of the needs for the New Testament (the rest being handled adequately by the ABP extensions). Steve also made introductions for me to meet with some experienced translators in the Translation Department at SIL/Dallas later in 1997, to better understand their needs and concerns.

1997 April -- First work in designing BibleTrans proof of concept. A simplistic demo in HyperCard was working in May.

1987 -- I began to realize that the compiler technology developed in my PhD thesis could also be used for Bible translation. I visited with Gary Simons, then Director of Academic Computing at SIL/Dallas, to explore opportunities for collaboration, but he had other priorities.

Rev. 2015 March 4

`00`	`Nop`	error/no operation (can't happen)
`01,nn`	`Lino`	line number nn, for debugger
`02,nn`	`OpFr`	open frame for calling procedure nn
`03`	`CallFr`	call procedure whose frame `02` opened
`04`	`Stop`	stop
`05`	`CallLN`	call L&N proc on ToS
`06`	`AnoLst`	iterate all nodes in list
`07`	`EnoLst`	iterate all but last node in list
`08`	`OK`	procedure exit, ok
`09`	`Done`	translation completed successfully
`0A,xx`	`Jump`	branch +/-xx bytes
`0B,xx`	`BrF`	branch on false +/-xx bytes
`0C,nn`	`NuVar`	create new var nn of ToS
`0D,nn`	`Sto`	store ToS into var nn
`0E,nn`	`Ld`	push var nn onto stack
`0F,nn`	nn	push integer nn
`10`	`False` / `""`	push false (empty string)
`11,str`	"str"	push literal string 'str'
`12,str`	`Do`	execute string 'str' as OS script (not yet implemented)
`13,x`	`Tree`	deref ToS+x, which is tree offset
`14`	`Pack`	PackPt ToS,ToS-1, build 32-bit integer from two 16-bit parts
`15`	`Swap`	swap ToS
`16`	`Pop`	pop ToS
`17`	`Dupe`	dupe ToS
`18`	`LdAtr`	deref ToS, which is tree; replace it with tree attribute string
`19`	`StAtr`	deref ToS, which is tree; store ToS-1 as tree attribute string
`1A`	`Rot3`	rotate ToS below next 2 (not yet implemented)
`1B`	`Pgph`	new paragraph
`1C`	`Capz`	capitalize
`1D`	`NoWds`	no word break
`1E`	`Emit`	emit ToS
`1F`	`Gloss`	gloss from ToS
`20-27`	`+ - * /` `% & \| ^`	integer arith/bit operators: `+ - * / % & \| ^`
`28`	`Decz`	decimalize (not yet implemented)
`29`	`Catn`	catenate
`2A-2F`	`< > =` `>=` `<=` `!=`	compare: `< >= <= > = !=`
`30`	`Len`	length(s), s on ToS
`31`	`Offs`	offset(x,s) returns the offset of x in s, or -1 if not contained
`32`	`Subst`	substring(i,n,s) is char i through length n of s
`33`	`Replc`	replace(x,i,n,s) put x into char i through length n of s
`34`	`ItmNo`	itemno(x,s) returns the item number if x is an item in s
`35`	`DelItm`	delitem(n,s) returns s with item n deleted
`36`	`Item`	item(n,s) returns item n of s
`37`	`CntItm`	CountItems(s)
`38`	`SubTr`	GetSubTree, same as `Tree,3`
`39`	`NxtNo`	GetNextNode, same as `Tree,2`
`3A`	`PutItm`	putitem(x,n,s) puts x into item n of s, leave result on ToS
`3B`	`Nouns`	extract NounList from reference tree in ToS
`3C`	`LNinTr`	true if ToS-1 tree contains ToS L&N#; 0 tests if it's a tree
`3D`	`Bref`	extract ToS Bref, if any, as 3-item string of numbers B,C,V
`3E`	`NouNo`	extract ToS noun ref, if any, as integer
`3F`	`UpNo`	Get parent tree, same as `Tree,1`
`40`	`CkTbVrs`	(GrammWhiz) Set all CheckTable vars from ToS
`41`	`LookTab`	Replace ToS table ID with its lookup value
`42`	`DWIM`	"Do What I Mean" = `Emit` or `AnoLst` or `CallLN`
`43`	`GetLN`	Recover concept from tree node, as integer d*1000+c
`44`	`NxTrLs`	Extract Next item from (string) Tree node List
`45`	`TrLsApd`	Append Tree node onto List
`4D,xx`	`xSto`	Pop index off stack, add it`-1` to `xx` and store ToS-1 in that var
`4E,xx`	`xLd`	Pop index off stack, add it`-1` to `xx` and push that var
`4F,xx`	`xRng`	If ToS<1 or >`xx` replace it with 0, else dupe ToS

Tag	Name	Content
-		frame
0	`Nul`	null
1	`Num`	integer
2	`Str`	string ptr
3	`Tre`	Tree ref
+		(see notes)

`<DocX ID=nn>..</DocX>`	Delimits the text body for one document page
`<Target=name/>`	Defines the name used for linking to this document page or reference
`<ref=name>..</ref>`	Links to another document, and defines the text to click on for linking
`<Icon=name/>`	Links to a named (or numbered) `Icon` image, displayed inline
`<Icon ID=nn name=name>` `..</Icon>`	Defines a named `Icon` image and delimits the pixel data for it
`<Strn ID=nn size=ww/>`	Links to a `Strn` resource string, with optional text width
`<title>..</title>`	Delimits the text used in the window title bar
`<Drop>..</Drop>`	Delimits the text displayed in "drop-cap" mode
`<sup>..</sup>`	Delimits the text displayed in smaller font as a superscript
`<Text>..</Text>`	Delimits the text displayed in the default font (normally omitted)
`<Face>..</Face>`	Delimits the text displayed in other defined fonts
`<p/>`	Start a new paragraph
`<Tab/>`	Indent, or leave some blank space within a paragraph
`<img height=hh width=ww` `align=aa>..</img>`	Defines an embedded image and delimits the data for it
`<Memo>..</Memo>`	Delimits some explanatory text that is not displayed
`<Node ID=ii Icon=nn` `col=cc>..</Node>`	* Defines a tree node element inside an image, and encloses the associated slot label items
`<Slot ID=ii>..</Slot>`	* Delimits the text of a tree node slot label, linked to another node
`<Link ID=ii/>`	* Defines a secondary link (as a vertical bar) for a multi-tree slot
`<LocVH=vv,hh/>`	Display next image element at coordinate [vv,hh]
`<LineTo=vv,hh/>`	Draw a line from current position to coordinate [vv,hh]
`<RectHW=vv,hh/>`	Draw a rectangle with its top/left corner at current position, with given height and width
`<Color=rr,gg,bb/>`	Set the current drawing color to given RGB values (0-5 each)
`<Ipix>..</Ipix>`	Delimits the numbers representing columns of pixels in the current color, used in drawing icons

Mood	Indic,Subj,Opt,Imper,Inf,Part	JXOECZ
Tense	Pres,Imperf,Fut,Aor,Perf,Pluperf	WIUBRL
Voice	Active,Middle,Passive,Mid/Pass	QKTH
Person	1,2,3	123
Number	Singular,Plural	SP
Gender	Masc,Fem,Neuter	MFN
Case	Nom,Gen,Dat,Acc,Voc	YGDAV

`GkwdLemm`	`Gkwd:` index to inflected Greek word `GrkX` `Lemm:` index to lexical form, also in `GrkX`
`ParsGlos`	`Pars:` index to parse code `Pars` `Glos:` index to set of glosses `GloX`
`0cPwLwIw`	`0c:` additional pixel width if caps `Pw:` parse code byte length `Lw:` lemma pixel width `Iw:` inflected word pixel width

`4pgaCCCC`	`p:` 2-bit punctuation `g:` 6-bit gloss sub-index `a: 1`-bit capitalization `C:` 16-bit cell index, see also `ILGW`
`nneeLLLL`	`n:` width of L&N tag `e:` width of gloss `L:` L&N tag (see `L&NS`)

0,h,w		Filled rectangle height v, width h
1,n,s,c		Text font style s, color c, for n characters following
2,v,h		Jump to coordinate vertical v, horizontal h
3,v,h		Draw line to coordinate vertical v, horizontal h
4,i		Icon, id i, usually a 4-character name
5		Icon
6,n,c		Pixel bits, color c, for n column words following
7,h,w		Pixel bytes following, height h, width w

0,ff,w	Text word w, with punctuation & inflection bits ff; 00 is end of doc
1,f,c	Single (or double) character c, with punctuation bits f; 1,0 is paragraph end
2,f,n	Decimal number n, with punctuation bits f
3,f,b,c,v	Bible reference (book,chapter,verse), with punctuation bits f
4,f,g,n	Interlinear text cell (`ILGW`) n and gloss increment g, with punctuation bits f (2 words)
5,f,s,n	`Strn` resource n, with punctuation bits f, and pixel width s; s=0 is whole line
6,`t`,s	Hot-link reference number `t`, with color & style bits s; t=0 is font change
7,r,i	`Icon` number i, size r
8+	Picture encoding
9+	Active Image