This diary does not bear any practical advantage and is not the tutorial, but it was very interesting to me to read it, I hope and this history will be pleasant to you too :)
I wrote the C compiler in 40 days which called 8cc. It is the diary written by me at that time. It is possible to look at a code and its history on GitHub.
I write the compiler. It began to work after writing about 1000 code lines. There are several examples which already work:
int a = 1; a + 2; // => 3 int a = 61; int *b = &a; *b; // => 61
Arrays are correctly transformed to pointers so the code below too works. Also the challenge of functions is supported.
char *c = "ab" + 1; printf("%c", *c); // => b
It was not difficult to implement it since I do it already the second time. I learned to be controlled with arrays and pointers better.
I far progressed in implementation of the compiler and it works surprisingly well. Nontrivial programs, for example this are solving a problem about eight queens, are compiled and started.
Of course it lacks many functions. These examples of programs are picked up so that not to use them.
Implementation is quite simple; there is no distribution of registers (register allocation) even.
It compiles programs on a stack, using a machine stack as a stack. Each operation demands storage access. But so far it suits me.
At the beginning the compiler found room approximately in 20 lines and the only thing on what it was capable to read it the whole value from a standard sign-in and to start the program which right there comes to the end returning this whole.
Now it contains about 2000 lines. If to look at git, then it seems developed thus:
- are added "+" and "-"
- phases of parsing and code generation are separated
- are added "*" and "/"
- the variables (which are implicitly meaning the int type) are added
- the challenge of functions is added
- lines are added
- the generator of tags (tokenizer) and the analysis of syntax are separated
- support of the declaration of base types
- pointers and arrays are added
- support of expressions of initialization of arrays
- "if" is added
- the declaration of functions is supported
- "for" and "return" are added
- assignment of pointers is supported
- it is added "=="
- indexation of arrays and the arithmetician of pointers are added
- are added "++", "-" and "!"
I successfully implemented structures. The structure is such object which can occupy more than one machine word. It is more difficult to implement them, than primitive types, but it was easier than I expected.
It seems, it works as it is necessary; I can define the structure containing structure. I can define the pointer on structure and dereference it. The structures containing arrays and arrays of structures also work. Though I already knew that the code theoretically has to work, I all the same was delighted when it really earned, even in such difficult case.
However, I do not feel that I completely understand why this code works correctly. It is felt a little magic because of its recursive nature.
I cannot transfer structures in function. In the calling convention for x86 the structure is copied on a stack and the pointer is transferred to it to function. But in x86-64 you have to separate structure into several parts of data and transmit them through registers. It is difficult so I will postpone it so far. Transfer of structures on value is necessary less than transfer of pointers on them.
It was easier to implement associations since it is just option of structure in which all fields have identical shift. The operator "-is also implemented>". As easy as shelling pears.
It was heavy to organize support of floating-point numbers. Implicit type conversion between int is similar and float works, but floating-point numbers cannot be transferred to functions. In my compiler all parameters of functions at first are located on a stack, and then register in registers as it should be defined in the calling convention x86-64. But in this process probably there is a bug. It returns an error of a memory access (SIGSEGV). It is difficult to debug it, considering an assembly output because my compiler does not optimize the assembler for reading. I thought that I will be able to finish with it in a day, but I was mistaken.
I spent time for nothing because forgot that according to the calling convention x86-64 the frame of a stack has to be aligned on 16 bytes. I found out that printf () falls from SEGV if to transfer it several floating-point numbers. I tried to find conditions under which it can be reproduced. It turned out that provision of a stack frame matters that forced me to remember about requirements of ABI x86-64.
I did not take care at all of it so the frame of a stack was aligned only on 8 bytes, but print () did not complain so far accepted only integral numbers. This problem can be easily corrected by correction of a stack frame to a challenge of the instruction of CALL. But not to avoid such problems if, carefully not to read the specification before writing of a code.
I changed indents in a compiler code with 2 on 4. I got used more to use of 2 whitespace indents because such are used at my work in Google. But for some reasons it seems to me indents in 4 spaces are suitable for "the beautiful program open source more".
There is one more, more significant, change. I rewrote tests from shell of scripts for C. Before this change each test function compiled by my compiler contacted main () which was compiled by GCC and then shell was started by a script. It was slowly since generated many processes for each test. I had no choice when I began the project since my compiler had no many functions. For example, it could not compare result to the expected value due to the lack of comparison operators. Now it rather powerful to compile a code of tests. So I rewrote them to make them quicker.
Still I implemented big types, such as long and double. Writing of a code was cheerful because I very quickly succeeded in implementation of these functions.
I almost finished implementation of a preprocessor of C in one day. Actually it is port from my previous attempt to write the compiler.
Implementation of a preprocessor C not simple task.
It is part of the standard C, so it is defined in the specification. But the specification says too little that it could be useful to independent implementation. The specification includes several macroes with their unrolled by a form, but it speaks about the algorithm very little. I she think does not even explain details of his expected behavior. Generally, it nedospetsifitsirovan.
As far as I know, PDF in this blog the only and best resource about that how to implement a language C preprocessor. The algorithm described in the document (Dave Prosser's algorithm) tries to unroll so many tokens as much as possible, avoiding infinite deployment of a macro. Each token has the history of deployment so tokens are not developed by the same macro more than once.
The preprocessor the C in itself is independent language. It has many features, and only experienced programmers C well it understand.
Tried to force the compiler to read system headings so now it understands #include. While I tried, I received many errors. It revealed that my preprotsessr still lacks many functions, for example operators who it is possible to use only in #if. There are also many bugs besides. I corrected them as soon as found.
System heading files big and confused. They demand many functions from the compiler, such as enum and typedef. I implemented their one for another, but sometimes I cut off corners. I try to read stdio.h. I do not know as a lot of time can occupy it.
The compiler consists of 4000 lines now. The small compiler LCC contains 12000 lines. Using it as the guide, I think, my compiler will be able to work as this compiler C soon.
I am surprised that I wrote 500 code lines today. I can work in a flow for 12 hours, but it can be inefficient since I am tired without noticing it. Anyway I have to recognize that I am a person with a large number of free time.
I do not remember that I corrected, but now stdio.h can be connected. It very abruptly because types of the functions defined in the heading file now are correctly processed.
The scheme which I use for implementation of the compiler means creation of the compiler for a small subset of language and then its development in this language C. Till the recent moment, I not strongly tried to implement functions which not up to the end I understand. I could write so much code how many I want and to leave the rest. It was cheerful.
External pieces, such as system of headings, became the reason of many problems. Now I have to implement all functions which are expected from "this" compiler C. I made many dirty hak to read stdio.h. For example, I implemented Haque for ignoring of all occurrences of a token of "const". It upsets me.
You can ask why not to make it the correct method from the very beginning. I would tell that it is not cheerful. Too. For example syntax of the declaration of types in C too confused without any responsible reason and it is not interesting to implement it at all.
Despite it there are some things of which I cannot avoid. Possibly, I should change my look to implement all functions, from beginning to end. I can find it interesting as approaching the purpose. Sometimes, I should write more code, than I want to achieve the goal.
I got stuck for two days with implementation of syntax of determinations and declarations without any success. Why I cannot finish it? I made pointers and structures in one day of work. I feel that I underestimated it. The plan can be necessary to me?
In a difficult situation as this, I probably have to remember the fact that the compiler was in only one file to see as I promoted in a month. It just read out whole through scanf () and printed it through printf (). Really, I very seriously promoted in one month. Yes, I think I can make it.
Stopped writing the parser for declarations and determinations. I think the reason for which I failed was that I tried to write too many parts from the very beginning so I wrote a pseudo-code to be convinced that I correctly understand everything and then converted it into a real code.
I write on C nearly 15 years, but only today I felt that I, at last, understand syntax of types in the Village. It is not surprising that I could not write the working code. It because I just did not understand it correctly.
The code which I just wrote is too difficult and fragile that even I hardly understand it. I do not believe that Dennis Ritchi, the creator of language C, understood effects of what it did. I suspect that it having thought up syntax, wrote a code which was more difficult, than he expected, and, as a result, was standardized by ANSI committee. It is difficult to implement the standardized language because you have to make everything correctly. It is rather easier to write own toy language.
Implemented a lot more operators and cleaned a code.
Today my compiler for the first time managed to compile one of the files. When I connected it with other files compiled by means of GCC, he was a worker. And the turned-out compiler seems too works. Probably the purpose becomes closer.
I implemented switch-case, continue, break and goto today. When I wrote the test cases for goto, the code of tests quickly turned spaghetti a code which I could not read. It made laugh me. I was convinced why goto is considered harmful.
Implemented functions for varargs, namely va_start, va_arg and va_end. They are used not often, but I needed them for compilation of functions, for example printf.
The vararg specification for C is thought not really well over. If you transmit all function arguments through a stack, va_start can be implemented quite easily, but on modern processors and in modern calling conventions arguments are transmitted through registers to reduce overhead costs of a challenge of functions. Therefore the specification does not correspond to reality.
Roughly speaking, ABI for x86-64 standardized by AMD demands that functions with variable number of arguments copied all registers in a stack to be prepared for the subsequent challenge of va_start. I understand that they had no other choice, but it is all the same looks clumsily.
It became interesting to me as other compilers process functions with variable number of arguments. I looked at the headings TCC and, seemingly, they are not compatible to ABI x86-64. If the data structure for varargs differs, then functions the transferring va_list (such as vprintf) become incompatible. Or I am mistaken? [And I really am mistaken — they are compatible.] I also looked at Clang, but it looks confused. I did not begin to read it. If I read too much code of other compilers, it can spoil fun from own implementation.
After correction of insignificant problems and adding of control sequences for string literals (still there were no '\0' and similar things), it turned out to compile one more file. I feel sure progress.
I tried to implement support of the functions having more than six parameters, but could not finish it in one day. In x86-64 the first 6 integer parameters are transmitted through registers, and remained through a stack. Transfer only through registers is now supported. It is not difficult to implement transfer through a stack, but it demands too much time for debugging. I think in my compiler there are no functions having more than six parameters so I will postpone their implementation so far.
Three more files are compiled today. Total 6 of 11. But if to consider code lines, then these are about 10% of total number. The remained files it is much more as contain a compiler kernel code.
Even worse the fact that in files of a kernel I use rather new features of C, such as compound literals and the assigned initialisers. They strongly complicate self-compilation. I should not have used them, but will rewrite a code on simple old C not productively therefore I want to support them in the compiler. Though on it time will be required.
Several notes on debugging tools. As the compiler is a difficult piece of a code which consists of many stages, is necessary to investigate somehow a method it for debugging. My compiler not an exception; I implemented several functions which counted useful.
First, lexical analyzer remembers the position of reading and when it is interrupted for the unexpected reasons, it returns this position. It allows to find easily a bug when the compiler does not accept correct input data.
There is an option of the command line for printing of an internal abstract syntax tree. If there is an error in the syntax analyzer, I want to look at a syntax tree.
The generator of a code allows to use widely a recursion because it generates fragments of assembler code when bypasses an abstract syntax tree. So I could implement printing pass traces of a stack for every line of an assembly output. I if I notice something wrong, I can trace the code generator, having looked at its output.
The majority of inner patterns of data have functions for conversion to a line. It is useful when using printf for debugging.
I always write a unit tests when I write new function. Even having implemented it, I try to save a code compiled to start tests. Tests are written so that to be executed for a short period so you can start them as often as you want.
Implemented compound literals and rewrote the initialiser of structures and arrays. I did not like the previous implementation. Now the initialiser became better. I had to write a beautiful code from the very beginning, but as I understood it, having only written a working code, rewriting was inevitable.
I think, the only feature which is not enough for self-compilation, this assignment of structures. I hope everything will work as it is conceived without special debugging when it is implemented.
The file, the containing tokenayzer is compiled, but the turned-out compiler of the second generation for some reasons does not generate correct assembler code. Though the code generated by the compiler of first generation takes all tests. Such here artful bug.
I think that I have no other choice, except use of printf for debugging because the second generation is compiled via my compiler which does not support the debug information. I added printf in suspicious places. The debugging messages printf were displayed at compilation of the second generation that surprised me a little. I wanted that debugging messages were displayed only when I use the second generation so I did not expect that the output will work when the second generation is only created.
It reminds me the movie "Beginning". We have to go more deeply to reproduce this bug. It is amusing part of debugging of the self-compiled compiler.
I corrected the problem arising in the second generation if lexical analyzer was samoskompilirovan. It caused a bug at which-1> 0 sometimes returned true (I forgot about sign expansion). There is one more bug in placement of structures (struct layout). There were only three files.
The code generator can compile now itself too. There were two files. Work is almost finished though I, perhaps, should not be too optimistical. There can be still unexpected reefs.
I corrected many problems caused by bad quality of a code which I wrote at an early stage of this project. It tired me.
I believed that I had all opportunities for self-compilation, but it is not the truth. There are no prefix operators of increment/decrement even. For some features of C99 I rewrote part of the compiler to make it more convenient for compilation. As I did not expect to reach a possibility of self-compilation so quickly, I used so many new features of C how many I wanted.
Hurrah! my compiler can compile completely itself now!
It took about 40 days. It is very short period, for writing of the self-compiled compiler C. You so do not think? I consider that my approach — in the beginning to make the small compiler for very limited subset of C, and then to transform it to this compiler C very well worked. Anyway today I am very happy.
Looking at the code, even knowing that I wrote it, I feel it a little magic because it can accept itself(himself) on an input and transform to the assembler.
There were many bugs and unrealized features. I, probably, will finish with them, and then I will begin work on improvement of an output code.
The source code is here. I do not know whether it is worth reading it, but can be interesting to look at it as a code sample in language C which can process the simple compiler in 5000 lines.
Returned to systematic development as the big milestone was reached. I changed a code, trying to read it as though I was the third party, and, as a result, was satisfied with quality of a code. It reminds undercutting of a tree a bonsai.
I added the test for self-compilation restart to be convinced that results are identical to the second and third generation of files. If I correctly remember, GCC have a similar test.
There is a part of information which can be transferred to the next generation only through the performed files in a binary format in spite of the fact that she does not register in the source code of the compiler.
For example, my compiler interprets "\n" (sequence of the return slash and the character of "n") in a string literal "\n" (in this case the character of a new line). If you think of it, you can consider it a little strange because he has no information on real ASCII a code for "\n". Information on a code of the character is not present at the source code, but is transferred from the compiler compiling the compiler. Characters of a new line of my compiler can be traced to GCC.
Compilers can transfer much more information, than a character code.
This strange story was provided by Ken Thompson to his lectures at receipt of an award of Turing. Information which Ken added to the early Unix version of the compiler gave the chance to the program for an input of users to accept some special passwords so Ken could log in to any Unix account. He also made so that the compiler distinguished own compilation and Haque transferred programs for an input to the child compiler so this backdoor (which was not in the source code) passed from father to son. You could not delete it even if carefully would examine each line of the source code of the compiler and recompiled it as the compiler which processes the source code was infected. It is a strange story, isn't that so?
Rewrote implementation of the parser for operators on a method of recursive descent instead of use of superiority of operators (operator-precedence parser). I think that the reason to select the parser the using priority of operators is simplicity and speed, but the grammar used in for operators in C too confused for processing in such a way (for example, indexes of arrays or different types of unary operators). Now it is easier to read a code, than earlier because big functions break into a set of small. Also now it has to be easier to add error checking to the parser.
Technology of writing of parsers are one of my most useful skills as programmer. It helped me uncountable number of times.
However, when I read the language specification Xi to write the parser for grammar the using recursive descent, I found out that some derivatives are left recursive. The textbook once again was necessary to think and open for me some time to remember how to rewrite grammar to make it right recursive. Elimination of a left recursion is a basic subject in parsing which is described in any input textbook. But I could not remember such elementary things as did not use equipment throughout a long time frame.
Input data will now be transformed thus: a line of characters → sequence of tokens → sequence of tokens after substitution of macroes → an abstract syntax tree → x86-64 the assembler. The last transition, perhaps does too much and is too confused. It executes different types of operations, such as implicit type conversions and deployment of names of tags while makes assembler code. Theoretically, I probably have to define an intermediate language between ASD and the assembler.
I read Dragon Book on this subject again, without feeling that I completely understood it. The book is too abstract so I cannot immediately apply it my case.
My compiler twice more slowly than when compiles GCC itself. It is not as bad as I thought. My compiler generates terribly artless assembler, but such code not more slowly much.
It was interesting to me how gcov will work at my code. It found a set of branches of a code which did not pass a unit tests. I found several bugs, having added tests for these branches of a code. Tools of a covering of a code are really useful.
I feel that I terminated ideas about what to do with the compiler. For me it was not difficult to reach a current status of development because I had not to study new pieces, but it seems to a thing outside this point exigeant.
I took out implicit type conversion from the code generator. They are obviously provided to ASD now. It allows to understand easily what occurs under a cowl. I also made unsystematic improvements in different places. I thought that work was almost complete, but actually there was a set of unrealized features and errors.
I became better to understand optimization of the compiler after read some more pages from Dragon Book. I could begin to write a code if I understand it slightly better.
I looked for a mysterious bug within three days: if you round up or down the pointer of a stack for alignment to 16-byte border, the compiler of the second generation displays an error message for correct input data. It is more difficult to debug such errors, than normal fatal errors.
My first assumption was: wrong alignment of the pointer of a stack. But it not so because the pointer of a stack was already correctly aligned on 16-byte border. I could not find bugs in an assembly output. I decided to halve.
I tried to compile each file via my compiler, and the rest through GCC to define function with which I can reproduce a problem. But it seemed that function does not contain errors. It is not function which displays an error message — they even not in one file. The theory is that one function creates incorrect data which cause errors in other functions.
After long unsystematic debugging, I at last found the reason: the compiler does not initialize a field of structure in zero. The specification C demands that at initialization of structure, has to autocomplete the fields which are not initialized values, the compiler in zero. I knew the specification when wrote a code (therefore my code depends on this behavior), but I forgot it to implement. As a result some fields of structures were initialized by garbage instead of zero. Garbage data change depending on current position of a stack therefore change of the pointer of a stack in a random way changed behavior of the program.
As a result, after three-day debugging only one line was corrected. It would be desirable to have some means to simplify similar debugging.
One more mysterious error is corrected. If to compile some files by means of GCC, and the others — my compiler, then the received compiler reports about syntax errors at correct input data. Such problems have to be connected with compatibility of ABI. My first assumption was that the problem can be connected with a marking of structures or subjects as arguments are transferred to functions, but the assembly output for them seems correct.
I narrowed search to the specific file separated in half again. I transferred functions in turn from one file to another to find function which was the cause of a problem. This function was not small therefore I separated it and transferred a code to other file. At last I received several code lines. I compiled them, by means of GCC and my compiler and compared.
The only difference consisted in it: my compiler checks all bits in the register to define whether it contains boolean value of true whereas GCC checks only younger 8 bits. Thus, for example, if value in the register 512 (= 29 or 0x100), then my compiler perceives it as true whereas GCC thinks differently. GCC actually returns some very large number with zero in the least significant 8 bits that is perceived as false.
Because of this incompatibility, the cycle which uses the function compiled by means of GCC for check of a condition of an output (at the same time the cycle is compiled by my compiler) stops at once on the first iteration. As a result, macroes of a preprocessor are not defined. And some of tokens remained indefinite and therefore the parser reported about syntax errors at some input data. The reason was far from that place where it was reported about an error, but I, at last, caught it.
The x86-64 ABI specification has a small note that only younger 8 bits are significant for logical returned values. I read it, but for the first time did not understand that it means so I did not even remember that such instruction existed. But now it is extremely clear for me. I have mixed feelings — I learned new things, but I could recognize them, without spending on it so much time.
Bit fields are implemented.
You can pack several variables in small area with use of bit fields, but the compiler all the same has to create space between them depending on their types. My fast research of an output of GCC revealed such rules (of course, without guarantee that they are correct):
- Alignment of the first bit field submits to alignment of type. For example, for "int x:5", x it is aligned on 4-byte border (it is supposed that int is 32 bits).
- Any variable should not cross the natural borders. For example, for "int x:10", x should not cross 4-byte border. If the remained bit count is not enough to meet this condition, they remain unused and the following bit field begins in the following borders.
The appeal to a bit field demands more than one machine instruction because CPU has no instructions for access to separate bits to memories. You have to read the memory containing a bit field in the register and then to use &маску for reading a bit field. When you store it, you have to read at first the memory containing a bit field, to apply a bit-by-bit mask &, bit-by-bit "or" to receive new value, and then to store value back.
The calculated goto (computed goto) are implemented. Normal goto can refer only to one tag, but the calculated goto can accept the pointer and transfer control to this address. There is no it in the standard C and exists as expansion for GCC. If you ever wrote the virtual computer with very big switch-case, you, most likely, know about this opportunity. This function is often used to replace the big switch-case switch with the table of conversions and the calculated goto.
The calculated goto is compiled in the uniform instruction of indirect transition. Probably, it is easier to understand it in terms of the assembler. My compiler generates the assembler so it was very easy to be implemented.
Implemented C11 _Generic because I wanted to use it in the compiler. But after implementation, I noticed that GCC 4.6.1 does not support _Generic. I cannot use this function as if I use it, then GCC will not be able to compile my compiler.
I also implemented the typeof function () which is also the GCC expansion. Both of these functions are not used at present, but it is normal as their implementation required not enough code.
Added digraphs from C99. The digraph is an unusual function for specific environments in which it is impossible to use some characters. For example,"<:" определяется как псевдоним для "[". Диграфы могут быть преобразованы в обычные символы при токенизации, так что это бесполезно, но и безвредно.
In C89 there are trigraphs which are obviously harmful. Trigraphs — trigram strings which will be transformed in one characters everywhere where they appear in the source code. For example, printf ("huh??!") will display not "huh??!", and "huh |" because"??!" it is the trigraph for "|". It strongly confuses. I have no plans, to support trigraphs.
I try to compile TCC. But so far I am far from success because of unrealized functions and errors.
TCC is a small compiler C which size makes from 20K to 30k code lines. If you delete support of architecture except x86-64, the quantity of lines will probably be about 10K-20K. It is surprising to see that such small compiler supports such set of functions. Fabrice Bellar, the creator of TSS — the genius.
I several times tried to read the source code TCC, but still I do not understand all picture. Compilers it is difficult programs therefore you usually want to break them into the small managed parts, but the source code TSS is felt how the monolithic compiler. I cannot write such magnificent code. I do not know whether I want to imitate, but the fact that in the world there is someone who can write such code, really makes impression on me.
Unsuccessfully continued to work on compilation of TCC. Of course it is heavy because if it passes, it means my compiler has the same functions as TCC. Compilation of other compilers is useful to debugging, but inherently it carping at each part of the language specification.
This article is a translation of the original post at habrahabr.ru/post/274083/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: firstname.lastname@example.org.
We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.