Basics of compiler construction. C # Lexical Analysis

The task of lexical analysis is to split the input sequence (in my case, the code in Pascal) into words and lexemes.

To begin with, I created 5 typed sheets for storing data, namely: identifiers, constants, keywords, separators, and convolutions. An array of delimiters is also required.

static char[] limiters = {',', '.', '(', ')', '[', ']', ':', ';', '+', '-', '*', '/', '<', '>', '@'};

and an array of keywords. I limited myself to eleven keywords, since the article was written as an initial example of the implementation of the lexical analysis of the Pascal language in C #.
So, the array of keywords:

 static string[] reservedWords = { "program", "var", "real", "integer", "begin", "for", "downto", "do", "begin", "end", "writeln" };

At the entrance we have the path to the .txt file with the program in the path variable. We read the entire file in the AllTextProgram and begin.

 StreamReader sr = new StreamReader(path, Encoding.Default); AllTextProgram = sr.ReadToEnd(); sr.Close();

In the for loop, take each character from the AllTextProgram one by one and send it to the Analysis method. Separately, I consider only two exceptions. The first is apostrophes, which in my case occur when using the “writeln” keyword. If we meet an apostrophe, then we depart from the rules and go through the AllTextPorgram variable until we find the second apostrophe. The second is the comments in the program. For the latter, I simply consider the number of opening and closing curly braces and I expect until the difference is equal to zero.
It would seem that the program should loop if the subsequent apostrophe is not found or the number of curly braces is not 0. Yes, this is true, in these cycles it is necessary to add additional conditions for exiting the test. I decided to make these conditions for systematization at the stage of syntactic analysis.

 for (i = 0; i < AllTextProgram.Length; i++) { char c = AllTextProgram[i]; if (AllTextProgram[i] == '\'') { temp += '\''; i++; while (AllTextProgram[i] != '\'') { temp += AllTextProgram[i]; i++; } temp += '\''; type = 2; Result(temp); temp = null; } if (AllTextProgram[i] == '{') { int chet = 1; while (chet != 0) { i++; if (AllTextProgram[i] == '{') chet++; if (AllTextProgram[i] == '}') chet--; } } Analysis(AllTextProgram[i]); }

The type variable is responsible for the number of the table to which the token belongs.
Identifiers - Table 1
Constants - table 2
Keywords - table 3
Separators - table 4

Now it's time to talk about what happens in the Analysis method. This is best done in the form of comments to the code. Immediately, I note that in the temp variable we will store the intermediate result of the work, until we meet the separator. When we encounter a separator, we need to send the temp variable to the Result method to determine the current token to the identifier, a constant, or a keyword.

 static void Analysis(char nextChar) { int acsiiCode = (int)nextChar; //    //              if (((acsiiCode >= 65) && (acsiiCode <= 90)) || ((acsiiCode >= 97) && (acsiiCode <= 122)) || (acsiiCode == 95) { if (temp == null) type = 1; temp += nextChar; return; } //           if (((acsiiCode >= 48) && (acsiiCode <= 57)) || (acsiiCode == 46)) { //  ,       ,    temp //  .   temp ,   //    ,     -  if (acsiiCode == 46) { int out_r; if (!int.TryParse(temp, out out_r)) goto not_the_number; } if (temp == null) type = 2; temp += nextChar; return; } not_the_number: if ((nextChar == ' ' || nextChar == '\n') && temp != null) { Result(temp); temp = null; return; } //        .      «:=»    . foreach (char c in limiters) { if (nextChar == c) { if (temp != null) Result(temp); type = 3; if (nextChar == ':' && AllTextProgram[i+1] == '=') { temp = nextChar.ToString() + AllTextProgram[i + 1]; Result(temp); temp = null; return; } if (nextChar == '<' && (AllTextProgram[i + 1] == '>' || AllTextProgram[i + 1] == '=')) { temp = nextChar.ToString() + AllTextProgram[i + 1]; Result(temp); temp = null; return; } if (nextChar == '>' && AllTextProgram[i + 1] == '=') { temp = nextChar.ToString() + AllTextProgram[i + 1]; Result(temp); temp = null; return; } temp = nextChar.ToString(); Result(temp); temp = null; return; } }

The final consideration is the Result method, where we determine what we have found. At the very beginning, we check the temp variable for belonging to keywords. We check at the beginning so as not to confuse the found token with the identifier. If the found is not a keyword, then through switch we check the type of the table defined in advance in the Analysis method. We do not forget to check that the found identifier / constant / delimiter is no longer listed in the table.

 static void Result(string temp) { for (int j = 0; j < reservedWords.Length; j++) { if (temp == reservedWords[j]) { for (int i = 0; i < tableR.Count; i++) { if (temp == tableR[i]) { LConv.Add("3" + i); return; } } tableR.Add(temp); LConv.Add("3" + (tableR.Count - 1)); return; } } switch (type) { case 1: for (int j = 0; j < tableI.Count; j++) { if (temp == tableI[j]) { LConv.Add("1" + j); return; } } tableI.Add(temp); LConv.Add("1" + (tableI.Count - 1)); break; case 2: for (int j = 0; j < tableC.Count; j++) { if (temp == tableC[j]) { LConv.Add("2" + j); return; } } tableC.Add(temp); LConv.Add("2" + (tableC.Count - 1)); break; case 3: for (int j = 0; j < tableL.Count; j++) { if (temp == tableL[j]) { LConv.Add("4" + j); return; } } tableL.Add(temp); LConv.Add("4" + (tableL.Count - 1)); break; } }

And an easy example to show how it all works.
At the entrance of the program:
')

program main; { pro{g}am } var sum: real; f, per_1, x_1,i:integer; begin {sum:=(-x_1+2.5)*4 - (x_1-6)*((((x_1+2))));} x_1:=18; f:= 456; for i:=10 downto per_1-f+i*(x_1+1) do begin per_1 := per_1 + x_1*sum*f; x_1:=-x_1-1; f:=(f+1)*(x_1-24700); sum:=(x_1+2.5)*4 - (x_1-6)*(x_1+2); end; writeln( 'summa = ' , sum); end.

And the result of the work:

Source: https://habr.com/ru/post/132422/

All Articles

Basics of compiler construction. C # Lexical Analysis

More articles: