I'm not really all that mysterious

scattered thoughts on code complexity and natural language

Steve Yegge’s rants about programming are always really interesting. I’m all about the big picture, and I like how he can properly abstract his arguments so that it makes sense to a non-specialist. Very few technically competent people (whatever the field) are actually able to do this, and if they could, it would certainly make cross-discipline interaction a lot easier.

His latest diatribe is essentially about the unmaintainability of massive amounts of code that has low semantic expressivity.

Several commenters seem to miss the point completely and start arguing about Lines of Code™ and how all projects are doomed to massiveness, and clearly, these people have never worked with languages such as Perl, Ruby, or Lisp.

But as other commenters have pointed out, what Yegge is actually talking about is the concept-to-code ratio. To put it another way, how many keywords does it take to spell out a particular concept?

This is intimately related to a notion in sociolinguistics that is commonly referred to as context which has a specifically narrow connotation in this field. From the rudimentary linguistics class I took as an undergrad, the most vivid example I remember is the contrast between Mandarin Chinese and English. Mandarin is generally classified as a high-context language (or, perhaps, high-context-dependence) whereas English is a generally classified as a low-context language (low-context-dependence). What this means is that if I were to say something in Mandarin, in theory, I could say it in far fewer words than its English counterpart, mostly because you have all the necessary cultural context to understand what it is I’m trying to say.

But this isn’t the only axis by which semantic expressivity can be judged. While Mandarin can be semantically compact, it is probably on the more difficult end of the spectrum of languages to learn and highly dependent on familiarity with Chinese culture. In contrast, English—which is more semantically expansive—makes for a wonderful lingua franca.

How does this apply to programming languages?

Java is a relatively low-context-dependent language. You have to spell out a lot of things. You have to specify type, class, etc., etc. Sure, it’s far fewer things than you would have to spell out in C, or God help you, x86 assembly, but when you compare it to languages like Perl and Ruby, it’s a lot of stuff. This surely contributes to the code bloat issue.

Java’s low-context-dependence is probably what makes it such a widely used language, though.

Ruby, however, seems to be a high-context-dependent language. I say this mostly because of my exposure to Rails. Code targeted to Rails is compact and semantically dense but you may have no idea how things are being implemented. The details have been abstractified and hidden, and it is well known that you can write a fully-functional Rails app without really knowing much Ruby at all.

But the difference between Ruby and Mandarin Chinese is that I feel that Ruby is far more transparent and easy to understand for the non-native.

A little personal background: I got my first computer—a Commodore 64—when I was 8 years old. I learned how to program in BASIC, 6502 Assembly, and Logo (which, if you strip away the turtle graphics, is apparently highly reminiscent of Lisp.) I got my first x86 machine when I was 13, and learned how to program in Pascal. (When I was in high school, the Advanced Placement Computer Science course was based on Pascal. I never actually took the course but managed to get a 4 out 5 on the AP test.) In college I dabbled a little in C and C++ and in grad school, I learned Perl.

In all that time, I never really wrote a complete app, unless you count the extraordinarily rudimentary patient database system I wrote in Turbo Pascal 5.0 for my dad, or the hacked-together Visual Basic program I set up to help me keep track of medical billing when I used to work for this solo family practitioner. The most complicated thing I ever wrote in Perl is a CGI script that transliterates Roman characters to [Alibata][4].

As you can see, I’m not a professional coder.

But I am an enthusiastic hobbyist linguist. And I’m more than a little intrigued by artificial intelligence. One of the things that I used to try and implement (in BASIC, of all languages!) was a programming language based on natural language. At the time, speech-to-text seemed to be an impossible, highly futuristic idea, but I figured that if they ever figure out how to turn speech into text, they’d still need an engine to parse it into actual commands. (These were the ideas I would get whenever I would hear Jean Luc Picard ask the ship’s computer to do something.)

In retrospect, this was probably a little too over the head of the average 10 year old, and at the time, I definitely couldn’t figure out how to get from here to there.

What I discovered, instead, were text adventure games.

The prototype for this genre is Colossal Cave Adventure, or more commonly, just Adventure. In modern parlance, they are sometimes known as interactive fiction. The variant/descendant that I was first exposed to was Zork. This was my first encounter with Infocom and their pioneering text parsing engine.

For the longest time, I had sought to implement my own text parsing engine, but to no avail. Interestingly, of all the languages that I had access to at the time, it seemed most intuitive to implement a text parsing engine in Logo.

But back to the modern-era of computer programming.

One can look at the divide between C/C++/Java/C# and Perl/Python/Ruby/Javascript as merely a generational-gap. The old school programmers tend to use the former. The newbies use the latter.

But I think there is another distinction to be made: the former group of languages tend to be compiled down into machine code or byte code fairly straightforwardly. The higher-level concepts, procedures, functions, objects, etc. are all representations of actual low-level entities. What the former group of languages allows one to do is control the machine with almost as much precision as one could with writing in straight-up machine language. The higher level language exists mostly so that a coder can actually read what’s going on, and so that they don’t have to look at the minutiae involved with saving registers and stacks and what-not every time they call a procedure.

The syntax and the implementation are tightly coupled—simple high-level concepts get translatetd to simple machine code and complex high-level concepts get translated to complex machine code—and sometimes this means that making the code easy-to-read and easy-to-maintain is diametrically opposed to keeping performance bearable and keeping memory usage sane.

The latter group of languages all started out as scripting languages, and it used to be that the main purpose of a scripting language was to automate various higher level tasks. I would hazard to guess that Larry Wall didn’t really intend people to build huge, monstrous piles of code to completely power multimillion dollar enterprises almost entirely in Perl, but what’s done is done, and here we are.

Part of the rationale for something like Perl is to make the invocation of certain tasks easier, in the sense that you can use somewhat more natural language to make a computer do something. Hence, Perl’s mantra of TMTOWTDI—There’s more than one way to do it. The syntax of the command may have no bearing whatsoever on how the actual task is implemented. A single line of Perl can get expanded into a humongous pile of complicated machine code, but the average coder need not know anything about the underying complexity. They can just get stuff done. In the era of 640K of RAM and 80 MB hard drives, this was not all that tenable. Coders needed to count every single byte and every single processor cycle.

But now we have machines that have gigabytes of RAM and terabytes of hard drive space, so this type of obsessive-compulsive fastidiousness is not as necessary. We can let the computer do even more of the grunt-work.

So the most intriguing part of modern languages like Ruby is the idea of self-reflection. This really gets the AI otaku in me excited. One way to look at it less sensationally, though, is to think of self-reflection as a way to make invoking certain tasks more like asking for things in natural language.

Rails is the perhaps the prototype in this regard. While MVC paradigms have existed for a while, the one thing that makes Rails stand out is the idea of convention over configuration. Configuration is the old school way to do things: control ever part of the system and tell the computer exactly how to do it, just shy of coding in pure machine language. Convention takes advantage of sociolinguistic context.

And with the pairing of convention over configuration (sociolinguistic context) with self-reflection, you’ve just come that much closer to implementing AI.

initially published online on:
page regenerated on: