Developers Club geek daily blog

1 year, 8 months ago
Periodically facing small problems on development of unpretentious analyzers of the text, solved this problem to automate and the academic interest haunted. Originally looked towards Racc (one of interpretations of Yacc), but it seemed me rather not a simple solution for my small tasks and then I decided to develop the unpretentious analyzer. Though, of course, if you develop the compiler or something similar and still commercially, then definitely you should look towards Racc.

But if you want to understand what is the parser and as fast to write it most, at the same time without reading a lot of articles about lexical analyzers like book of a dragon, then forward under kat (though the book very good).

Analysis of an input flow


If to consider a task of the analysis of an input flow very abstractly, and I read out from it it is necessary to begin, then everything is reduced to implementation of the finite state machine, i.e. machine of statuses where a condition of transition to a certain status is existence in an input flow of the necessary data (further templates). In turn transition to a status can be executed only at unambiguous and complete coincidence of characters of an input flow to a specific template why unambiguous since from one status transition to several statuses and still input characters of a flow is possible at once match several templates of statuses at the same time, the analyzer is in an uncertainty status. Proceeding their aforesaid, the analyzer can is in the following statuses:

  • determinancy status.
  • status not of determinancy.
  • change of a status.

Of course, also wrong statuses, but since we consider the generalized model of the analyzer are possible and we do not know on what sets characters it is necessary to give an error message, this task lays down on shoulders of specific implementation.

In a determinancy status we precisely know what processor of a status to us it is necessary to cause, but in an uncertainty status we do not know what there will be the following status i.e. with what template there will be a coincidence or eventually unambiguous coincidence will not be, then in this case we should buffer an input flow and then to transfer characters from the saved-up buffer to already certain status, i.e. at uncertainty we buffer an input flow and already in the mode of change of a status we have to give these buffers to a new status or return flowing if transition did not happen.

Now, it is possible to make decomposition, more precisely detailing of process of the analysis of an input flow on:
classification of an input flow (matches one, many and does not match one template) and processing of result of classification (transition, buffering, etc.).

Implementation


Though on the main kind of activity I am developers of the built-in systems and the main my language of programming is ANSI C, I decided to implement this model on Ruby since on it you think of technical details of implementation less and was going to implement this model as Ruby Gem. I very much love all universal (in a certain framework), short and laconic. Further I am going to implement this model in emdedded for the analyzer of an input traffic, but in more simplified type and it is possible on HDL.

So, we will begin, once again I will repeat, I know Ruby not really deeply therefore staked on the saved-up experience and accurate algorithmization of process at himself in the head therefore considered that my knowledge quite for this purpose will be enough Ruby as my teacher of music once spoke: "learn to improvise and do music on three-five notes, and then you will already thresh abrupt passages, and it is not possible to listen to that your continuous arpedzhio". Sometimes remembering it, now I consider that it was right though then there was a wish for abrupt solyak, all right, we will return to a subject.

I consider development process has to go step by step at what stages have to be "tangible", i.e. you have to understand what needs to be made for implementation of this stage and in what volumes (labor input), deeply about it I will not be now, not this subject of our conversation, but for implementation of the parser, its building and testing has to go on gradual building of a code with each status. And here I could not do without interactive mode if someone uses my gem, I think to it too originally it is necessary to debug behavior of its model in an interactive mode but only then to write processors of statuses of the parser.

Example


For bigger clearness I will give example of implementation of the simple parser of the file of the source code for automatic generation of documentation which is bearing a faint resemblance to doxygen. Originally it is necessary to set my gem:

$ gem install iparser

As you already understood gem iparser is called and after its installation we will create the file with the name 'parser_example.rb' in which we will write a code of our parser based on iparser. First of all we will connect library and we will create object the parser machine which implements the main logic of the analyzer, i.e. a copy of the parser:

require 'iparser'

parser = Iparser::Machine.new

The following step, it is already possible to describe parser statuses. We will have absolutely simple parser and it will have only three statuses:

  • initial (idle).
  • one-line comments (comment-line).
  • multiline comments (comment-block).

Now we can describe each status separately (I will include below a script code in general). I will not describe a status of 'idle' in detail since in our case it empty, and we will consider a status of 'comment-line' on closer. For creation of a copy of statuses we use the following code:

ps_idle = Iparser::State.new('idle')
ps_cline = Iparser::State.new('comment-line')

In parameters of the designer of a status the line with a name (identifier) of a status is specified, it is desirable clear to you since it is used for distant debugging of model. Each object of statuses of the parser has an entry field, a template coincidence to which transition to this status, i.e. an input condition to a status is made. Comparison with a template is made character-by-character, also as well as processing of an input flow. The input in a status will be performed in the presence in an input symbol stream' \\\', going in a row. There is also a leave field to specify a template for an output from a status (return to previous), in our case it is '\n\r line end characters' though it is enough to specify the first of them, but for descriptive reasons we will specify both. Now we will describe 'comment-line':

ps_cline.entry << /\//
ps_cline.entry << /\//
ps_cline.entry << /\//
ps_cline.leave << /[\n\r]/

Pay attention, templates are set by means of regular expressions, also without them, but it is possible then.

Further we will create a status for processing of multiline comments, we will get to it at a meeting of characters '/**', and to leave it in the presence */'. Too we will write:

ps_cblock = Iparser::State.new('comment-block')
ps_cblock.entry << /\//
ps_cblock.entry << /\*/
ps_cblock.entry << /\*/
ps_cblock.leave << /\*/
ps_cblock.leave << /\//
ps_cblock.ignore << '*'

Also we specified characters which should be ignored, being in this status. At us it is the character an asterisk since I like to write multiline comments of a type:

/**
 * ...
 * ...
 */

Now it is necessary to connect our three statuses in a uniform chain, from 'idle' we can get to 'comment-line' and 'comment-block', and from them only back to 'idle'. Binding is performed by means of the indication of indexes of statuses in the field of branches of each of statuses. The index will be defined by order of adding of objects of statuses in a parser copy, for adding of objects in the parser the addstate parser method is used.

ps_idle.branches << 1
ps_idle.branches << 2

parser.addstate ps_idle
parser.addstate ps_cline
parser.addstate ps_cblock

And at last it is necessary to check whether correctly we created a chain of statuses, for this purpose we will start an interactive mode (the prestart method — will set all initial parameters):

parser.prestart
parser.interactive_parser

For bigger clearness I will give a script code in general, of course I rearranged it a little, but it contains only that code I wrote above:

require 'iparser'

#
# Create parser-machine object.
#
parser = Iparser::Machine.new

#
# Create startup state for this parser-machine.
#
ps_idle = Iparser::State.new('idle')

#
# Add branch indexes to 'comment-line' and 'comment-block' state.
#
ps_idle.branches << 1
ps_idle.branches << 2

#
# Create single line comment state for this parser-machine.
#
ps_cline = Iparser::State.new('comment-line')
ps_cline.entry << /\//
ps_cline.entry << /\//
ps_cline.entry << /\//
ps_cline.leave << /[\n\r]/

#
# Create multiline comment state for this parser-machine.
#
ps_cblock = Iparser::State.new('comment-block')
ps_cblock.entry << /\//
ps_cblock.entry << /\*/
ps_cblock.entry << /\*/
ps_cblock.leave << /\*/
ps_cblock.leave << /\//
ps_cblock.ignore << '*'

#
# Add all states to parser-machine.
#
parser.addstate ps_idle
parser.addstate ps_cline
parser.addstate ps_cblock

#
# Call parser startup method.
#
parser.prestart

#
# Call interactive mode for check state-machine.
#
parser.interactive_parser

Not such and a big code, we will start a script now:

$ ruby parser_example.rb

For an output from an interactive mode it is necessary to enter blank line (simply to click enter at once).

We enter '\\\' and we see that that on the last character the parser passes into a status of 'comment-line' (branch to comment-line), we enter '\now n' or '\r' and we see that the parser comes back to a status of 'idle' (back). The character '\' is used for input of esc-characters, respectively to enter the character '\' it is necessary to take him twice '\\', well as well as at the indication of lines in Xi and other languages. In an interactive mode it is possible to play about to the top of the bent, yet be not convinced that the chain of all statuses is connected correctly.

The final stage, considering that we checked and debugged logic of transitions of our automatic machine, it is possible it is possible to add processors for our statuses (and only now since wrote above about the development process course), they will be absolutely simple. Each status can have three processors:

  • init: status initialiser (designer of a status).
  • handler: processor of a status.
  • fini: status finalizer (status destructor).

The designer of a status will be called only once at an input in a status and will receive an array of buferizirovanny characters which were saved up by the parser as argument while was in the uncertainty mode. The processor will be constantly called and as argument will receive the character of an input flow until he leaves a status, and the destructor will be caused only once at an output from this status.

The method processor (handler) can also manage operation of the parser though can do it I and in vain implemented while is, means it is necessary to describe. If the processor returns Fixnum data type value of which is in limits (> =0), then it will be regarded as an index for transition to some status if the index oversteps array boundaries of statuses, the parser will throw out an exception. If the processor returns nil, the parser will hold this status i.e. to what reaction and if any other value, then it is regarded as an error and the parser will return false that signals about a parsing error since in all other cases it returns true. I will give below the reversed code since it seems and so in detail tried to describe everything.

require 'iparser'

#
# Simple check startup arguments.
#
if( ARGV.size != 1 || !File.exist?(ARGV[0]) )
  puts
  puts "ERROR: unable to open file #{ARGV[0]}"
  puts
  exit
end
#
# Create output file.
#
$fout = File.new( 'index.html', 'w' )

#
# Create initializer method for parser-states.
#
def doc_init ( str )
  $fout.print "<p>"
end
#
# Create handler method for parser-states.
#
def doc_handler ( c )
  $fout.print c
end
#
# Create finalizer method for parser-states.
#
def doc_fini ( str )
  $fout.puts "</p>"
end

#
# Create parser-machine object.
#
parser = Iparser::Machine.new

#
# Create startup state for this parser-machine.
#
ps_idle = Iparser::State.new('idle')

#
# Add branch indexes to 'comment-line' and 'comment-block' state.
#
ps_idle.branches << 1
ps_idle.branches << 2

#
# Create single line comment state for this parser-machine.
#
ps_cline = Iparser::State.new('comment-line')
ps_cline.entry << /\//
ps_cline.entry << /\//
ps_cline.entry << /\//
ps_cline.leave << /[\n\r]/

#
# Create multiline comment state for this parser-machine.
#
ps_cblock = Iparser::State.new('comment-block')
ps_cblock.entry << /\//
ps_cblock.entry << /\*/
ps_cblock.entry << /\*/
ps_cblock.leave << /\*/
ps_cblock.leave << /\//
ps_cblock.ignore << '*'

#
# Add handlers for states.
#
ps_cline.init( method(:doc_init) )
ps_cline.handler( method(:doc_handler) )
ps_cline.fini( method(:doc_fini) )

ps_cblock.init( method(:doc_init) )
ps_cblock.handler( method(:doc_handler) )
ps_cblock.fini( method(:doc_fini) )

#
# Add all states to parser-machine.
#
parser.addstate ps_idle
parser.addstate ps_cline
parser.addstate ps_cblock

#
# Call parser startup method.
#
parser.prestart

#
# Call interactive mode for check state-machine.
#
$fout.puts "<html>"
$fout.puts "<body>"

File.open( ARGV[0], 'r' ).each do |line|
  line.each_char do |c|
    parser.parse(c)
  end
end

$fout.puts "</body>"
$fout.puts "</html>"
$fout.close

Yes, pay attention that our processor (doc_handler) does not return nil since we do not analyze result of operation of the parser (parser.parse method). For the last check we will create the test file with the name 'test.c' and we will fill it with the following contents:

#include <stdlib.h>

///Test function - 1.
void test1 ( void )
{
}

/**
 * Test function - 2.
 */
void test2 ( void )
{
}

Let's start our script last time and we will see result of operation of our parser.

$ ruby parser_example.rb test.c

It is the index.html file, on it everything, all thanks for attention!

I hope that to someone it will help to be become and will reduce time for studying of the basic principles of operation of analyzers, and someone can and will want to trade on my gem without having found simpler option if it was not pleasant, strongly do not kick though is not present of course, criticize, not to become the good creator without good criticism as. If to whom it is interesting, here the link to gem iparser.

This article is a translation of the original post at habrahabr.ru/post/271969/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus