lili: the little literate programming tool

lili is a simple command line program written in portable dependency-free C that aims to provide the absolute bare minimum of functionality needed to enable writing programs in a literate programming style. All lili does is scan a plain-text file for a handful of control sequences, extract source code delimited by those control sequences, and output source code files by recursively expanding the chunks of code thus found.

This minimalist approach ensures users of lili can depend on it to continue working reliably and easily modify the source code to tailor the program to their specific needs, such as adding support for a particular typesetting markup or programming language.

what is literate programming?

Literate programming is a very simple idea: Write a book. And oh, by the way, the book has the actual executable source code in it.

Timothy Daly, 2013. "Literate Programming in the Large"

Literate programming (first described by Donald Knuth in his paper introducing the idea) is a style of computer programming that prioritizes the relationship between the program writer and program reader. The essence of this approach is the position that computer programs should be written primarily as a form of communication between humans and only incidentally as a means of instructing computing machines to perform useful jobs.

Few would disagree that excellent computer code, in addition to whatever other positive qualities it may possess, should be easy for any programmer to read and understand. Conventionally, this ideal is approached by carefully choosing variable and class names, breaking up complex subroutines into well-named functions, writing unit tests that demonstrate how a given subroutine is used, and so on; basically, the code itself is written to be self-describing. After all, the source code has the final word on what the program does and how it does it. Comments, documentation, and any other things besides are all irreconcilably seperate from the actual logic of the program that the code describes, and they even run the risk of misleading the reader of a program. As Robert Martin wrote, "the proper use of comments is to compensate for our failure to express ourself in code."

On the other hand, there is much that the code itself may struggle to communicate or which may even be impossible for the code to communicate. Programming languages are constrained by the systems they are meant to control, by the systems that translate from source code to machine code, and so on. These constraints in turn influence the writers of the source code, the ideas that they can express through the language, and the ways in which they can present those ideas. Source code excels at expressing what the program does and how it does it. Code often struggles when it comes to communicating why it does what it does and why it does it in that particular way rather than taking some other approach. Almost every programming language provides some mechanism for embedding natural language comments in the source code, precisely to help programmers to express that which the code cannot. Documentation is also increasingly integrated in programming language tooling. But both comments and documentation remain subordinate to the source code. They serve to help the reader decode the source code. The source code itself remains supreme, and often also remains opaque, esoteric, and difficult to understand.

Literate programming proposes to invert this relationship: rather than write a program with some explanations embedded in it, you craft a piece of writing with a program embedded in it. Rather than write comments and documentation interspersed in source code, write source code interspersed in a natural language narrative. Rather than write a program primarily to control the behavior of a computer and only incidentally to be read by humans, write a piece of prose primarily to communicate with humans and only incidentally to be executed by a computer. This is the theory of literate programming.

In order to write a literate program, the expressive power of most programming languages is insufficient. For this reason, the practice of literate programming usually involves the use of specialized tooling that allows for expressing source code embedded in a piece of literature. These tools allow the source code to be ordered and explained using the structure that most naturally expresses the meaning of the program, rather than the order and structure that is most convenient to the compiler or interpreter. This involves, for instance, allowing code chunks to be expressed in any order.

should I write literate programs?

Proponents of literate programming argue that the style provides numerous benefits. The prioritization of communication over execution allows programmers unfamiliar with a program to understand and take ownership of it, allowing new team members in a collaborative environment to become productive contributors to the project more quickly. Literate programs are also purportedly more resilient to the passage of time and changing of hands, as new maintainers of a program are endowed with the knowledge of not only what and how a program works but also why and what for. Furthermore, writing the natural language exposition of a literate program naturally forces the writer to not only better comprehend the program they are writing but also to write a program that is fundamentally more comprehensible. As Donald Knuth writes, "my programs are not only explained better than ever before; they also are better programs, because the new methodology encourages me to do a better job." Literate programming also seems especially well suited to environments in which understanding the program is as important as the program itself, such as in academic research where a program is part of research findings.

However, despite the purported benefits, literate programming has not seen much mainstream adoption. Although there have been numerous remarkable success stories (notably one book written in a literate programming style that won an Academy Award), most programmers do not program in a literate style. This may be explained in part by the disadvantages of literate programming.

As Knuth recognized in the paper introducing the style, literate programming practice usually entails writing literate programs in a document that combines typesetting markup, literate programming specific markup, and source code. This introduces the opportunity for syntax errors in three different languages in addition to normal program logic errors, with the associated increase in the complexity of debugging a literate program. More generally, literate programming practice introduces additional tooling that necessarily adds some complexity to writing programs.

Programming in a literate style requires authoring and maintaining not only the program itself but also the literary work in which the program is embedded. As the program changes, it is essential to ensure that the associated literature is updated with discipline lest it drift out of sync with the actual program logic it is meant to describe. The cost of maintaining the prose in addition to the program can be magnified in environments where the requirements for the program and/or its design may frequently and drastically change.

However, the most damning challenge to the adoption of literate programming comes not from the tooling nor the maintenance of prose, but simply from the fact that programming is hard, writing is hard, and very few people are good at both. This means that programmers with aptitude for literate programming are naturally somewhat rare.

In summary:

Pros:

Improved readability decreases time for new programmers to integrate into collaborative environments
Program maintenance is easier especially as old maintainers become separated from the project
Authoring prose alongside code encourages writing higher quality programs
Emphasis on communication is a natural fit for e.g. programs produced as part of research

Cons:

Additional tooling imposes increased debugging and programming environment complexity
Prose introduces additional maintenance overhead, especially when requirements change quickly
Writing good programs is hard enough without also having to write good literature

If program requirements are reasonably well defined, if it is common for newcomers to contribute to a project, if code maintainers are likely to change over time, if the people involved in a project are reasonably good writers as well as programmers, and especially if the full and varied meanings of the source code are as important as the code itself, it may be worthwhile to consider adopting a literate programming style.

Writing in a literate programming style may seem less useful for beginners, hobbyists, and other amateurs, who may not have the well define program requirements and deep understanding of the programs they are writing, and who generally work alone and so miss out on the collaborative advantages of literate programming. However, the same people are very likely to benefit from more widespread adoption of literate programming, because reading and understanding a literate program is much easier than doing the same with programs written in a more conventional style. This is the advantage of literate programming most exciting to me personally. Literate programs have great potential as educational tools. In the best case, literate programs can elevate and empower the reader in a way conventional programs simply can't.

how does `lili` work?

The remainder of this document (the literate source code for lili) describes the functionality and implementation of lili.

lili knows nothing about the machine source code language, has no runtime configuration or flags, and makes as few assumptions as possible about the typesetting markup used for prose passages. Furthermore, lili provides no functionality for weaving, i.e. producing typesetting markup from the literate source code. This functionality is not essential for literate programming, which is about organization and communication rather than typesetting and generating cross references (see this article for example). If the writer wishes to produce a typeset document from the source code, typesetting markup may be written directly in the lili-formatted document and the typesetting system can be run directly on this document. For example, this literate source document for lili is typeset in Github-flavored Markdown. Alternatively, the user may edit the program to introduce support for weaving.

Despite its intentional limitations, lili provides enough functionality to enable writing in a literate programming style. Code chunks can be presented to the reader in any order, and code chunks can be referred to by name inside of other code chunks, allowing for high levels of abstraction in the presentation of the program. When outputting source code, chunks are recursively expanded into the output source files while retaining indentation for compatibility with white space sensitive languages such as python.

The help text explains the control sequences lili searches for to extract the machine code embedded in your natural-language document:

// ~='help text'
const char * help =
"lili: the little literate programming tool -- version %s\n\
\n\
    USAGE: %s file\n\
\n\
    lili extracts machine source code from literate source code.\n\
\n\
All control sequences begin with a special character called ATSIGN, which is \n\
set to '@' by default. Except for escaped ATSIGNs, all control sequences consume\n\
(i.e. cause to be ignored) the remainder of the line following the end of the\n\
control sequence.\n\
\n\
ATSIGN:new control character  Redefines ATSIGN to whatever immediately follows\n\
                              the : sign. This is useful if your machine source\n\
                              language uses lots of @ signs, for example. This\n\
                              control sequence is ignored if it occurs inside\n\
                              a code chunk definition.\n\
\n\
ATSIGN='chunk name'           Begin a regular chunk definition. The chunk name\n\
                              must be surrounded by matching single (') or\n\
                              double (\") quotes (no backticks). The chunk\n\
                              definition itself begins on the next line.\n\
                              Anything preceeding the ATSIGN and following the\n\
                              final quote sign is ignored, allowing this control\n\
                              sequence to be wrapped in typesetting markup.\n\
\n\
ATSIGN#'chunk name'           Begin a tangle chunk definition. This is similar\n\
                              to a regular chunk, except the name of the chunk\n\
                              is also interpreted as a file name, and the chunk\n\
                              is recursively expanded into the file with that\n\
                              name, overwriting any existing file.\n\
\n\
ATSIGN+'chunk name'           Append to a chunk. The code starting on the\n\
                              next line will be added at the end of the chunk\n\
                              named by 'chunk name'. This is useful e.g. for\n\
                              adding #include directives to the top of a c file.\n\
                              If no such chunk already exists, then this control\n\
                              sequence is equivalent to a normal chunk\n\
                              definition (ATSIGN='name').\n\
\n\
ATSIGN{chunk invocation}      Invoke a chunk to be recursively expanded into any\n\
                              tangled output files. Anything preceeding the\n\
                              ATSIGN is considered as indentation, and every\n\
                              line of code in the recursively expanded chunk\n\
                              will have this indent prepended to it. Anything\n\
                              following the closing brace is ignored, however it\n\
                              is recommended not to put anything there for\n\
                              compatibility with possible future extensions that\n\
                              may make use of this space (i.e. text following\n\
                              the closing brace is reserved).\n\
\n\
ATSIGNATSIGN                  Escape sequence. The rest of the line following\n\
                              the initial ATSIGN (including the second ATSIGN)\n\
                              is treated as normal code to copy to output\n\
                              tangled documents. This allows ATSIGN and special\n\
                              control sequences to be passed through to the\n\
                              output documents.\n\
\n\
ATSIGN/                       End an ongoing chunk declaration.\n\
\n\
ATSIGN[anything else]         An ATSIGN followed by any other character is\n\
                              ignored. However, it is recommended to avoid such\n\
                              character sequences for compatibility with future\n\
                              extensions.\n\
\n";
// end help text ~/

program overview

lili is implemented with three main processing stages: setup, extraction, and output. The setup phase reads the file to be processed into memory. During extraction, the file is scanned for code chunk definitions, and these are logged in data structures convenient for output. The output phase then recursively expands code chunks into files.

// ~#'lili.c'
~{definitions}

void lili(char * file, dict * d, list ** tangles)
{
    char * source;

    ~{load file into `char * source`}

    ~{extract code chunks}
}

int main(int argc, char ** argv)
{
    dict * d;
    list * tangles = NULL;
    char * file;

    ~{setup}

    lili(file, d, &tangles);

    ~{output tangle chunks recursively}

    return 0;
}
// end lili.c ~/

code chunk extraction

code chunk representation

A chunk of code is represented by the following structure:

// ~='code chunk struct'
typedef struct CodeChunk
{
    char * name;
    list * contents;
    int invocations;
    int tangle;
} code_chunk;
// ~/

The invocations member is used when tangling output to enforce that each chunk may only be used once (since lili is not meant to act as a text preprocessor and should not support "function call by copy-and-paste" usage idioms), and to allow a warning to be raised if a chunk is defined but never used.

The list of contents is populated by entries in the form of the following structure:

// ~+'code chunk struct'
typedef struct ChunkContents
{
    char * string;
    code_chunk * reference;
    int partial_line;
} chunk_contents;
// ~/

There are two different types of chunk contents, which can be differentiated on the basis of how the fields of the chunk_contents struct are populated. The most common kind of contents is a single line of code. In this case the string field represents the line of code, and the reference field is left empty (i.e. points to NULL). The other kind of contents is a chunk invocation. In this case the string field represents the indentation preceeding the invocation, while the reference field points to the chunk referred to by the invocation. An enum type and a function are provided which formalise the distinction, and subroutines are provided for constructing contents of one or the other type, as well as for constructing a code chunk.

// ~+'code chunk struct'
typedef enum ContentType {code, reference} content_t;

content_t contents_type(chunk_contents * c)
{
    if (c->reference != NULL) return reference;
    else return code;
}

chunk_contents * code_contents_new(char * code)
{
    chunk_contents * c = malloc(sizeof(chunk_contents));
    c->string = code;
    c->reference = NULL;
    c->partial_line = 0;
    return c;
}

chunk_contents * reference_contents_new(char * indent, code_chunk * ref)
{
    chunk_contents * c = malloc(sizeof(chunk_contents));
    c->string = indent;
    c->reference = ref;
    c->partial_line = 0;
    return c;
}

code_chunk * code_chunk_new(char * name)
{
    code_chunk * chunk = malloc(sizeof(code_chunk));
    chunk->name     = name;
    chunk->contents = NULL;
    chunk->invocations = 0;
    chunk->tangle = 0;
    return chunk;
}
// ~/

scanning for code

The outer-most loop of the extraction phase ignores your literate prose while scanning for control sequences and newline characters. Anytime a newline character is encountered the variable line_number is incremented. line_number is currently only used when printing error messages to help the user find the location of the error. When an ATSIGN is encountered, control flow switches depending on the character following the ATSIGN. Redefining ATSIGN, recursing with a referenced file, and printing a warning when encountering an unknown control sequence, these are simple cases and copied below. Extracting chunk definitions is more involved, so the description follows.

// ~+'globals'
char ATSIGN = '@';
int line_number = 1;
// ~/

// ~='extract code chunks'
{
    char * s = source;
    while (*s != '\0')
    {
        if (*s == '\n') ++s, ++line_number;
        else if (*s++ == ATSIGN)
        {
            switch (*s)
            {
            case '#':
            case '=':
            case '+':
                ~{extract chunk definition}
                break;
            case ':':
                ++s;
                exit_fail_if ( (  *s == '=' || *s == '#' || *s == '+'
                               || *s == '{' || *s == ':' || *s == '/'
                               || *s == '\n'
                               )
                             , "Error: Cannot redefine ATSIGN to a character "
                               "used in control sequences on line %d\n"
                             , line_number
                             );
                ATSIGN = *s++;
                break;
            default:
                exit_fail_if(1
                            , "Error: Unrecognized control sequence ATSIGN%c "
                              "while scanning prose on line %d\n"
                            , *s, line_number
                            );
            }
        }
    }
}
// ~/

parsing and extracting a chunk

When a well formed chunk definition is encountered, a new chunk is created or an existing one is retrieved for appending to, and an inner loop parses the chunk definition.

// ~='extract chunk definition'
if (*(s + 1) != '\'' && *(s + 1) != '\"') 
{
    exit_fail_if(1
                , "Error: Chunk definition sequence on line %d is missing a "
                  "quote-delimited name. Ignoring\n"
                , line_number
                );
    break;
}
else
{
    code_chunk * chunk;
    ~{prepare chunk}
    ~{parse chunk}
}
// ~/

When a chunk definition is found, s points to the character following the ATSIGN, which determines what type of definition to extract (1). The next character is either a single or double quote, and s is advanced to this character (2.a) before being passed to extract_name to fulfill the invariant of that function (that s should point to the delimeter surrounding the name when it is passed into the function) (2.b).

Once the name is extracted, any existing chunk by that name is retrieved from the chunk dictionary (3). If no chunk by the given name already exists, then a new one is allocated (4). This is the expected behaviour of regular and tangle definitions, and append definitions if the named chunk hasn't been defined already.

If the named chunk already exists and the definition is a regular or tangle chunk (5), then the contents list of the chunk has to be examined. If it is not empty, this indicates that the chunk is being redefined, which is considered an error (6). If the contents list is empty, then the chunk must have been allocated in response to its invocation in a chunk defined before the present definition. In this case, or in case the existing chunk was explicitly retrieved for appending to, nothing else needs to be done. If requested, the selected chunk is marked (7.a) and added to the list of chunks to tangle into output machine code (7.b), and then s is advanced to point to the start of the next line (8). Chunk parsing and extraction can now proceed to populate the contents list of chunk.

// ~='prepare chunk'
{
    /* (1) */
    int tangle = *s == '#';
    int append = *s == '+';

    ++s; /* (2.a) */
    char * name = extract_name(&s); /* (2.b) */
    chunk = dict_get(d, name); /* (3) */

    if (chunk == NULL) /* (4) new chunk definition */
    {
        chunk = code_chunk_new(name);
        dict_add(d, chunk);
    }
    else if (!append) /* (5) */
    {
        exit_fail_if(chunk->contents != NULL /* (6) */
                    , "Error: Redefinition of chunk '%s' on line %d.\n"
                      "       Maybe you meant to use a + chunk or accidentally "
                      "used the same name twice?\n"
                    , name, line_number
                    );
        /* todo: free existing chunk? */
    }
    if (tangle)
    {
        chunk->tangle = 1; /* (7.a) */
        list_push(tangles, (void *)chunk); /* (7.b) */
    }
}

exit_fail_if(!advance_to_next_line(&s) /* (8) */
            , "Error: File ended before beginning of definition of chunk '%s' "
              "on line '%d'\n"
            , chunk->name, line_number
            );
// ~/

Once the chunk is prepared and pointing to the chunk to edit, parsing takes place. s points to the beginning of the first line of the chunk on entry to the parse loop (1). The loop scans until it encounters a newline (2.a) or a control sequence (2.b). In these cases, a full line can be extracted, and the pointer to the start_of_line can be updated (3.a, 3.b). An exception is when an unrecognized control sequence is encountered (3.c), in which case the loop continues processing the line from the character after the false-alarm ATSIGN (3.d). The loop may also scan to the end of the file if there is an error, such as if the author forgets to put a termination sequence at the end of the last chunk definition in the file (4).

// ~='parse chunk'
{
    char * start_of_line = s; /* (1) */
    for (;;)
    {
        if (*s == '\n') /* (2.a) */
        {
            ~{extract code line}
            start_of_line = s; /* (3.a) */
        }
        else if (*s == ATSIGN) /* (2.b) */
        {
            ++s;
            if (*s == '/')
            {
                ~{end chunk extraction}
            }
            else if (*s == '{')
            {
                ~{extract reference line}
            }
            else if (*s == ATSIGN)
            {
                ~{extract line with escape sequence}
            }
            else /* (3.c) */
            {
                exit_fail_if(1, "Error: Unrecognized control sequence ATSIGN%c "
                                "while parsing chunk on line %d\n"
                            , *s, line_number
                            );
                continue; /* (3.d) */
            }
            start_of_line = s; /* (3.b) */
        }
        else /* (4) */ 
        {
            exit_fail_if(*s == '\0'
                        , "Error: File ended during definition of chunk %s"
                        , chunk->name
                        );
            ++s;
        }
    }
}
// ~/

A line of code is extracted when the parse scans a whole line without encountering any control sequences. The pointer to the start of the line is copied to a new contents list entry (1), and the entry is added to the list (2). The line number is incremented manually (3).

The newline character at the end of the line is then changed in place to a null character (4), effectively terminating the string pointed to by the contents list entry; this strategy is used throughout the program, for extracting code, indents, and chunk names, and saves from having to allocate more memory to copy strings that are already represented in the memory region pointed to by s.

// ~='extract code line'
chunk_contents * full_line = code_contents_new(start_of_line); /* (1) */
list_push_back(&chunk->contents, (void *)full_line); /* (2) */
++line_number; /* (3) */
*s++ = '\0'; /* (4) */
// ~/

When the control sequence for the end of the chunk definition is encountered, s is simply advanced to the next line and the extraction loop is escaped with a break statement. No further action is necessary.

// ~='end chunk extraction'
advance_to_next_line(&s);
break;
// ~/

When a chunk invocation sequence is encountered, three things need to happen:

(1) the indent string needs to be extracted, 
(2) a pointer to the chunk referred to by the invocation needs to be retrieved, 
(3) and both of these need to be added to the contents list.

The start_of_line pointer already marks the beginning of the indent (1.a). The ATSIGN marks the end of the indent, so it is converted to a null character (1.b, 1.c), terminating the indent string in place.

To get a pointer to the chunk referred to by the invocation, the name between the braces of the invocation (2.a) is looked up in the chunk dictionary (2.b). If the chunk doesn't already exist, a new chunk is constructed using the name from the invocation (2.c). This chunk can be defined later, at which point it will be recognized as having been invoked before its definition by the fact that its contents list is empty.

With a pointer to the chunk and a pointer to the null terminated indent ready, the new contents entry can be constructed and appended to the contents list (3). The rest of the line is ignored (4).

// ~='extract reference line'
code_chunk * ref;
char * indent = start_of_line; /* (1.a) */

{
    char * end_of_indent = s - 1; /* (1.b) */
    *end_of_indent = '\0'; /* (1.c) */
}

{
    char * name = extract_name(&s); /* (2.a) */
    ref = dict_get(d, name); /* (2.b) */
    if (ref == NULL) /* chunk hasn't been defined yet */
    {
        ref = code_chunk_new(name); /* (2.c) */
        dict_add(d, ref);
    }
    exit_fail_if(!advance_to_next_line(&s) /* (4) */
                , "Error: File ended during definition of chunk '%s'\n"
                  "       following invocation of chunk '%s' on line '%d'\n"
                , chunk->name, name, line_number
                );
}

list_push_back(&chunk->contents, (void *)reference_contents_new(indent, ref)); /* (3) */
// ~/

A line with an escape sequence is very similar to a regular line of code, except with an extra ATSIGN in the middle that shouldn't be tangled to outputs. This is represented by adding two entries to the contents list, the beginning part spans from the beginning of the line up to (but not including) the first ATSIGN. The ending part spans from the second ATSIGN to the end of the line, including the second ATSIGN in its range. These are terminated in place at the end of the subroutine after s has been advanced to the end of the line (1).

The beginning part is marked as a partial line. This flag is picked up by the tangling subroutine to suppress the newline that is normally printed after code entries (2).

// ~='extract line with escape sequence'
chunk_contents * beginning_part = code_contents_new(start_of_line);
chunk_contents * ending_part = code_contents_new(s);
char * at_the_atsign = s - 1;

exit_fail_if(!advance_to_next_line(&s)
        , "Error: File ended during definition of chunk '%s'\n"
          "       following the escape sequence on line '%d'\n"
        , chunk->name, line_number);

/* (1) */
*at_the_atsign = '\0'; /* terminate beginning part */
*(s - 1) = '\0'; /* terminate end part; s - 1 points to a newline character */

beginning_part->partial_line = 1; /* (2) */

list_push_back(&chunk->contents, (void *)beginning_part);
list_push_back(&chunk->contents, (void *)ending_part);
// ~/

code tangling and output

Each code chunk recorded in the list of chunks to tangle is output to a file named the same as the code chunk. The top level loop iterates over all the tangle chunks (1), opens their file (2), and calls into the recursive print function (3):

// ~='output tangle chunks recursively'
for(; tangles != NULL; tangles = tangles->successor) /* (1) */
{
    FILE * f;
    code_chunk * c = tangles->data;

    ~{prevent invocation of tangle chunks}

    f = fopen(c->name, "w"); /* (2) */
    if (f == NULL)
    {
        exit_fail_if(1, "Error: Failed to open file '%s', skipping tangle\n"
                    , c->name
                    );
        continue;
    }
    code_chunk_print(f, d, c, NULL, 1); /* (3) */
    fclose(f);
}
// ~/

The representation of code chunks is designed mainly for the convenience of this function. Every contents entry is either a line of code or a reference to another chunk. If a chunk contains code (1), it is printed. Otherwise, the function recurses.

// ~='code chunk print'
void code_chunk_print(FILE * f, dict * d, code_chunk * c, list * indents, int tangle)
{
    list * l;

    ~{prevent multiple invocations}

    for (l = c->contents; l != NULL; l = l->successor)
    {
        chunk_contents * contents = l->data;
        if (contents_type(contents) == code)
        {
            ~{print code}
        }
        else if (contents_type(contents) == reference)
        {
            ~{recurse}
        }
    }
}
// ~/

Chunk invocation contents entries keep track of indentation in their struct. In order to ensure that nested invocations end up with nested indentation, all of the indents have to be kept track of. A linked list of indents is used for this purpose. As the print function recurses deeper, indents are appended to the list. As the layers peel back again, the indents are removed from the list.

// ~='recurse'
code_chunk * next_c = contents->reference;
list_push_back(&indents, (void *)contents->string);
code_chunk_print(f, d, next_c, indents, 0);
list_pop_back(&indents);
// ~/

When it comes time to print a line of code, the list of indents is expanded and the contents of the line are printed, except when the line is empty (1). When a partial line is encountered, its successor is immediately printed; this is necessary to avoid indentation being printed after escape sequences (2). Finally, a newline terminates the line of code (3).

// ~='print code'
if (*contents->string != '\0') /* (1) */
{
    /* print indents on non-empty lines */
    list * i;
    for (i = indents; i != NULL; i = i->successor)
    {
        fputs((char *)i->data, f);
    }
    fputs(contents->string, f); 
}


if (contents->partial_line) /* (2) TODO should this be while? */
{
    l = l->successor;
    exit_fail_if(l == NULL
                , "Error: Partial line without successor in chunk '%s':\n"
                  "       %s"
                , c->name, contents->string
                );
    contents = l->data;
    fputs(contents->string, f);
}

fputc('\n', f); /* (3) */
// ~/

Multiple invocations of any given chunk are not permitted by lili, which is not a text preprocessor and does not wish to encourage users to duplicate code by making it easy to do so. Disallowing multiple invocations also simplifies the prospect of untangling machine source to synchronize changes back to literate source, which may be implemented in the future.

// ~='prevent multiple invocations'
if (c->invocations != 0)
{
    if (c->invocations == 1)
        exit_fail_if(1, "Error: Ignoring multiple invocations of chunk %s.\n"
                    , c->name
                    );
    c->invocations += 1;
    return;
}
else if (c->tangle)
{
    if (tangle) c->invocations = 1;
    else
    {
        exit_fail_if(1
                    , "Error: Ignoring invocation of tangle chunk %s within "
                      "another chunk.\n"
                    , c->name
                    );
        return;
    }
}
else c->invocations = 1;
// ~/

extra details

Read on if you are interested in further details, such as the definition of the list and dict datatypes, the helper functions, and other minutiae.

list type

This is a very simple implementation of a linked list. The user is responsible for ensuring the list entries are appropriately converted to and from void pointers, and for managing the memory and lifetime of the data in the entries, but the list nodes themselves are allocated and freed appropriately automatically.

// ~='data types'
typedef struct List
{
    void * data;
    struct List * successor;
} list;

/* a list must be initialized with data */
list * list_new(void * d)
{
    list * l = malloc(sizeof(list));
    l->data = d;
    l->successor = NULL;
    return l;
}

/* we need a double pointer so that if we are passed lst == NULL we mutate *lst
 * so that it points to a new list. This happens in dict_add if the new chunk
 * is hashed into a bucket not already containing a pointer to a list */
void list_push_back(list ** lst, void * elem)
{
    list * a = list_new(elem);
    if (*lst == NULL)
    {
        *lst = a;
    }
    else
    {
        list * l = *lst;
        while (l->successor != NULL) l = l->successor; /* go to the end of l */
        l->successor = a;
    }
}

void list_pop_back(list ** lst)
{
    if (*lst == NULL) return;
    if ((*lst)->successor == NULL)
    {
        free((void *)(*lst));
        *lst = NULL;
        return;
    }
    else
    {
        list * l1 = *lst;
        list * l2 = (*lst)->successor;
        while (l2->successor != NULL) 
        {
            l1 = l2;
            l2 = l2->successor;
        }
        free((void *)l2);
        l1->successor = NULL;
        return;
    }
}

void list_push(list ** lst, void * elem)
{
    list * p = list_new(elem);
    if (*lst == NULL)
    {
        *lst = p;
    }
    else
    {
        list * l = *lst;
        *lst = p;
        p->successor = l;
    }
}

void list_pop(list ** lst)
{
    list * p = *lst;
    if (p == NULL) return;
    list * l = p->successor;
    *lst = l;
    free((void *)p);
}
// ~/

dict type

This is a very simple hash map dictionary. It only holds code chunks, but code chunks are technically just named lists, so really it could hold anything. The above linked list implementation is used to hold entries, so the memory etc. management features are basically the same as for lists (i.e. basically none), except there's currently no way to remove entries from the dict.

// ~+'data types'
~{code chunk struct}

/* http://www.cse.yorku.ca/~~oz/hash.html */
unsigned long hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while ((c = *str++)) hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

typedef struct Dict
{
    list ** array;
    size_t size;
} dict;

dict * dict_new(size_t size)
{
    dict * d = malloc(sizeof(dict));
    d->array = malloc(size * sizeof(list));
    memset(d->array, (size_t)NULL, size * sizeof(list));
    d->size = size;
    return d;
}

void dict_add(dict * d, code_chunk * c)
{
    unsigned long h = hash((unsigned char *)c->name) % d->size;
    list_push_back(&(d->array[h]), (void *) c);
}

code_chunk * dict_get(dict * d, char * name)
{
    unsigned long h = hash((unsigned char *)name) % d->size;
    list * l = d->array[h];
    while (l != NULL)
    {
        code_chunk * c = l->data;
        if (strcmp(name, c->name) == 0) return c;
        else l = l->successor;
    }
    return NULL;
}
~/

helper functions

This one is a wrapper around printf for killing the program if an unrecoverable error is encountered.

// ~='functions'
void exit_fail_if(int condition, char * message, ...)
{
    if (!condition) return;
    va_list args;
    va_start(args, message);
    vfprintf(stderr, message, args);
    va_end(args);
    exit(EXIT_FAILURE);
}
// ~/

This one null terminates in place a name delimited by matching identical characters (e.g. 'name', "name", .name., $name$) or braces (e.g. {name}) returns the name, and as a side effect advances the char * to the character after the terminal name delimiter. The pointer argument must point to the first delimiter, and the function fails if the second delimiter can't be found on the line, or if the name is empty.

// ~+'functions'
char * extract_name(char ** source)
{
    char * s = *source;
    char terminus = *s++;
    char * destination = s;

    switch (terminus)
    {
        case '{': terminus = '}'; break;
        case '[': terminus = ']'; break;
        case '(': terminus = ')'; break;
        case '<': terminus = '>'; break;
    }

    for (; *s != terminus; ++s)
        exit_fail_if ( (*s == '\n' || *s == '\0')
                     , "Error: Unterminated name on line %d\n"
                     , line_number
                     );
    exit_fail_if ( destination == s
                 , "Error: Empty name on line %d\n"
                 , line_number
                 );

    *s = '\0';
    *source = s + 1;
    return destination;
}
// ~/

This helper scans for the end of the line, incrementing line_number if successful, and returning 0 if unsuccessful. Failure, in this case, means that the end of the file was encountered before the end of the line, and the fail state is returned instead of just dying so the caller can provide more detail in the error message.

// ~+'functions'
int advance_to_next_line(char ** source)
{
    char * s = *source;
    while (*s != '\n') if (*s++ == '\0') return 0;
    *source = s + 1;
    ++line_number;
    return 1;
}
// ~/

Code chunk print is described above, and appended here in the program.

// ~+'functions'
~{code chunk print}
// ~/

setup routine

// ~='setup'
~{parse command line arguments}

~{allocate dict memory}
// ~/

Presently, lili takes no flags and accepts only one file to tangle at a time.

// ~='parse command line arguments'
if (argc < 2 || argc > 2 || *argv[1] == '-' /* assume -h */) 
{
    fprintf(stderr, help, VERSION, argv[0]);
    exit(EXIT_SUCCESS);
}
file = argv[1];
// ~/

// ~='load file into `char * source`'
{
    int file_size;
    size_t fread_size;
    FILE * source_file = fopen(file, "r");
    exit_fail_if ( (source_file == NULL)
                 , "Error: Could not open source file %s\n", file
                 );
    
    /* get file size */
    fseek(source_file, 0L, SEEK_END);
    file_size = ftell(source_file);
    rewind(source_file);
    
    /* copy file into memory */
    source = malloc(1 + file_size); /* one added for null termination */
    fread_size = fread(source, 1, file_size, source_file);
    if (fread_size != file_size) fprintf(stderr, "Warning: fread_size doesn't match file_size???\n");
    fclose(source_file);
    
    source[file_size] = 0;
}
// ~/

// ~='allocate dict memory'
d = dict_new(128); /* for storing chunks */
// ~/

front matter structure

These last chunks are invoked by the main tangle chunk and serve mainly to abstract away some detail at that high level.

// ~='includes'
#include <ctype.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <stddef.h>
#include <stdarg.h>
// ~/

// ~='definitions'
~{includes}

~{globals}

~{data types}

~{functions}

~{help text}
// ~/

tests

The main test case for lili is its own literate source (i.e. this document). Producing lili.c by tangling lili.lili exercises all of the control sequences specified in the help text. lili should produce a source file when run, and the same resulting source file every time it is run. This result should be the same regardless of which version of lili is run. These assertions are tested in the Makefile included in this repository. Simple run make test to ensure lili is working as expected.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
test		test
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.mk		config.mk
lili.c		lili.c
lili.lili		lili.lili
todo.md		todo.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lili: the little literate programming tool

what is literate programming?

should I write literate programs?

how does `lili` work?

program overview

code chunk extraction

code chunk representation

scanning for code

parsing and extracting a chunk

code tangling and output

extra details

list type

dict type

helper functions

setup routine

front matter structure

tests

About

Releases 1

Packages

Languages

License

DocSunset/lili

Folders and files

Latest commit

History

Repository files navigation

lili: the little literate programming tool

what is literate programming?

should I write literate programs?

how does lili work?

program overview

code chunk extraction

code chunk representation

scanning for code

parsing and extracting a chunk

code tangling and output

extra details

list type

dict type

helper functions

setup routine

front matter structure

tests

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

how does `lili` work?

Packages