[TUHS] c's comment

Fri Dec 7 18:31:56 AEST 2018

At 2018-12-07T04:27:41+0000, Caipenghui wrote:
> Why can't c language comments be nested? What is the historical reason
> for language design?Feeling awkward.

I'm a callow youth compared to many on this list but I'll give it a try.

My understanding is that it's not so much a historical reason[1] as a
design choice motivated by ease of lexical analysis.

As you may be aware, interpretation and compilation of programming
languages often split into two parts: lexical analysis and semantic
parsing.

For instance, in

int a = 1;

A lexical analyzer breaks this input into several "tokens":
* A type declarator;
* A variable name;
* An assignment operator;
* A numerical literal (constant);
* A statement separator;
* A newline.

Because we'll need it in a moment, here's another example:

char s[] = "foobar"; /* initial value */

The tokens here are:
* A type declarator;
* A variable name;
* An array indicator, which is really part of the type declarator
  (nobody said C was easy);
* An assignment operator;
* A string literal;
* A statement separator;
* A comment;
* A newline.

The lexical analyzer ("lexer") categorizes the tokens and hands them to
a semantic parser to give them meaning and build a "machine" that will
execute the code.  (That "machine" is then translated into instructions
that will run on a general-purpose computer, either in silicon or
software, as we see in virtual machines like Java's.)

There is such a thing as "lexerless parsing" which combines these two
steps into one, but tokenization vs. semantic analysis remains a useful
way to learn about how programs actually become things that execute.

To answer your question, it is often desirable, because it can keep
matters simple and comprehensible, to have a programming language that
is easy to tokenize.  Regular expressions are a popular means of doing
tokenization, and the classic Unix tool "lex" has convenient support for
this.  (Its classic counterpart, a parser generator, is known as
"yacc".)

If you have experience with regular expressions you may realize that
there are things that it is hard (or impossible[2]) for them to do.

In classic C, there is only one type of comment.  It starts with '/*'
not inside a string literal, and continues until a '*/' is seen.

It is a simple rule.  If you wanted to nest comments, then the lexer
would have to keep track of state--specifically, how many '/*'s it had
seen, and pop one and only one of them for each '*/' it encounters.

Furthermore, you have another design decision to make; should '*/' count
to close a comment if it occurs inside a string literal inside a
comment?  People might comment out code containing string literals,
after all, and then you have to worry about what those string literals
might contain.

Not only is it easier on a programmer's suffering brain to keep a
programming language lexically simple--see the recent thread on the
nightmare that is Fortran lexing, for instance--but it also affords
easier opportunities for things that are not language implementations to
lexically analyze your language.

A tremendously successful example of this is "syntax" highlighting in
text editors and IDE editor windows, which mark up your code with pretty
colors to help you understand what you are doing.

At this point you may see, incidentally, why it is more correct to call
"syntax highlighting" lexical highlighting instead.

A well-trained lexical analyzer can correctly tokenize and highlight:

int a = 42;
int a = "foobar";

But a syntax parser knows that a string literal cannot be assigned to a
variable of integral type--that's a syntax error.  It might be nice if
our text editors would catch this kind of mistake, too, and for all I
know Eclipse or NetBeans does.  But doing so adds significantly more
machinery to the development environment.  In my experience, lexical
highlighting largely forecloses major categories of fumble-fingers or
rookie mistakes that used to linger until a compilation was attempted.
Back in the old days (1993!) a freshman programmer on SunOS 4 would be
subjected to a truly incomprehensible chain of compiler errors that
arose from a single lexical mistake like a missing semicolon.  With the
arms race of helpful compiler diagnostics currently going between LLVM
and GCC, and with our newfangled text editors doing lexical analysis and
spraying terminal windows with avant-garde SGR escapes making things
colorful, the learning process for C seems less savage than it used to
be.

If you'd like to learn more about lexing and parsing from a practical
perspective, with the fun of implementing your own C-like language
step-by-step which you can then customize to your heart's content, I
recommend chapter 8 of:

	_The Unix Programming Environment_, by Kernighan and Pike,
	Prentice Hall 1984.

I have to qualify that recommendation a bit because you will have to do
some work to port traditional K&R C to ANSI C, and point out that these
days people use flex and bison (or flex and byacc) instead of lex and
yacc, but if you're a moderately seasoned C programmer who hasn't
checked off the "written a compiler" box, K&P hold one's hand pretty
well through the process.  It's how I got my feet wet, it taught me a
lot, and was less intimidating than Aho, Sethi, and Ullman.

I will venture that programming languages that are simple to parse tend
to be easier to learn and retain, and promote more uniformity in
presentation.  In spite of the feats on display in the IOCCC, and
interminable wars over brace style and whitespace, we see less variation
in source code layout in lexically simple languages than we
(historically) did in Fortran.  As much as I would love to have another
example of a hard-to-lex language, I don't know one.  As others pointed
out here, Backus knew the revolution when he saw it, and swiftly chose
the winning side.

I welcome correction on any of the above points by the sages on this
list.

Regards,
Branden

[1] A historical choice would be the BCPL comment style of '//',
reintroduced in C++ and eventually admitted into C with the C99
standard.  An ahistorical choice would have been using '@@' for this
purpose, for instance.

[2] The identity between the CS regular languages and what can be
recognized by "regular expression" implementations was broken long ago,
and I am loath to make claims about what something like perlre can't do.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://minnie.tuhs.org/pipermail/tuhs/attachments/20181207/ddc6c8a4/attachment.sig>