How I created the HTML killer

In the life of every self-taught novice programmer, there is a moment when watching the next video in the series “Basics of% language_name%” is no longer interesting, but it’s too early to ask for juniors. At this point, I want to test my strength on some "almost present" project. Under the cut - the story of how I came up with such a project and what came of it.

So it happened that in our country informatics teachers love Pascal. And in pedagogical universities, students are usually taught to him. I have nothing against this language, however, being a student of a pedagogical university, I wanted to learn something more mainstream.

After getting acquainted with some programming languages and projects of the “Hello world!” Level on them, I caught the eye of Python.
')
Familiarity with its syntax and capabilities led me to delight. I read all the books on the python that I found, studied hundreds of tutorials, reviewed several videos. Understandably, I was impatient to put my knowledge into practice. Therefore, when on one of the subjects the labs allowed me to pass in any language, I delivered a certain number of problems to the teacher (apparently, none of the students used to indulge in such delights of teachers). Having made several such laboratory projects, I wanted to move on and get down to the “real” project.

One day, reflecting on the idea of such a project, I saw a picture from the beginning of the post and it dawned on me: you need to write down your markup language, with Russian-language commands and brackets!

Idea

The picture gave an idea, but this version of the “markup language made in the USSR” did not seem very good. In short, language requirements can be formulated as follows:

All commands must be available in the standard Russian layout;
HTML compilation;
"No!" Closing tags.

The first requirement was dictated by the concept itself, the second by the desire to see the results of the work in practice. As for the third requirement, I didn’t like to write duplicate closing tags since high school (like </ body> and </ bold>), although their necessity is indisputable. Therefore, it was decided to introduce a universal "locker" that would close any tags.

Syntax

Here you need to clarify that the actual language was not an end in itself, the main task was to pump up programming skills in python. Therefore, the development of syntax was given a minimum of time - in fact, it is tracing with HTML with some features. This was largely dictated by the desire to get the ability to compile in HTML.

In general, a command of the language of the test language can be described as follows:

\::::()

The attribute part is optional. Using :: to separate attributes may seem like an unreasonable decision (for example, [] or {} would be more obvious), but this is a sacrifice offered by the “Russian-speaking” layout.

Implementation

As mentioned above, I am not a real welder and have never written compilers. Although in general, the principles of their work represented.

And here I came in handy knowledge of the theory of automata, obtained in the framework of the university course "Theoretical Foundations of Informatics." I remember, then they seemed to me useless - perhaps the thing is how this course was taught.

The algorithm of the compiler operation was divided into 2 stages:

Getting tokens from the source list;
Use the resulting list to generate code.

To obtain a list of tokens, the source text was viewed almost character-by-character; if a “boundary” character was encountered, the current token was added to the list. Tokens boundaries could be: \, (,),%, commands, end of line character, etc.

At the second compilation stage, a state machine was used that stored the current state and, possibly, the state stack. Consider an example.

The code should look something like this:

 \(   \(),   - \())

That after compilation was supposed to turn into

 <p>   <b></b>,   - <i></i></p>

To implement the machine were created:

TokenList - a list that stores tokens;
STATE - the current state of the machine;
statement - state stack

This can be depicted as follows:

Let the token \ par to the automat enter first. Since this token is a command of the language, it sets the new state (<p), which becomes the current one, and the previous state (<body>) is saved to the state stack.

Please note: the new state becomes just <p, not <p> - this is due to the fact that state names are used to generate code, and there may be some attributes of the tag in front of the> symbol. In fact, there are two different states: <p and <p>, the automaton goes to the last when the token is supplied to the input "(". A similar scheme was used for all commands, therefore, it was considered to be the first states as "incomplete tags", and the second ones "Full tags".

If the token ")" arrives at the input, and the current state is a "full tag", a pair tag is added to the generated file (for example, </ p> tag), and the last state is removed from the top of the stack.
The parsing ends when, after the next ")", it turns out that the state stack is empty.

Text between :: and :: is considered tag attributes. It was assumed that their parsing will use its own parsing algorithm, so at first it was simply added between the tag name and the> character. On the one hand, this allowed the use of HTML syntax within the commands of the NGPR, on the other hand, this text was not analyzed at all, so there was no guarantee that there were no errors in it from the HTML point of view.

results

In general, the results of the work can be assessed as ambiguous.

Initially, the goal of the project was to pump up their skills in programming in python and create a product that would not be ashamed to show a potential employer (well, to amuse CHS, of course). And if the first part is less good (I have mastered the language better, plus I dealt with github a little), then the second part is not so smooth.

Some problems that have not been resolved:

Problems with the implementation of the analogue tag <pre>;
Attribute analysis is not implemented — in fact, they are simply substituted into an HTML tag;
The idea of refusing to switch layouts was (so far?) Untenable - attributes and styles still have to be written in English, there is no implementation of any YKTS even in the project;
Ambiguous solution: compiler in an interpreted language;
Not all tags are implemented;
It is possible to write a “real website” on the NGTR, but it is still necessary to know HTML.

The second and fifth points of this list are, in principle, solvable, but I must admit that at the moment I have lost some thought to this idea.

On the other hand, there were some new thoughts. In particular, I would like to figure out how to properly organize compiler options through command line arguments. Or how to create a text editor with syntax highlighting of the NGTR. But then I do not even know from which edge to grab.

Total : a certain development experience was obtained, but the project cannot be called successful. Well, yes, the appearance of the "HTML killer" is postponed indefinitely.

The compiler sources are on GitHub . Questions, comments, suggestions, criticism and assistance are accepted.

Source: https://habr.com/ru/post/309552/

All Articles

How I created the HTML killer

Idea

Syntax

Implementation

results

More articles: