Chapter 1. Introduction - The Basics

Table of Contents

Text Handling - Inside a Computer
ASCII
Unicode
Fonts (for computers)
The special features of Indic scripts

In this introductory chapter, I will be explaining (in brief) how a computer handles characters. This chapter deals with the concepts behind text encoding, codepages and the need for having a standardised text encoding system. However, please do always keep in mind that this is a very superficial treatment of the subject, just enough to get you started with minimal confusion. If you are interested in getting deep into subject, please look at the links I have provided at the end of this document.

Text Handling - Inside a Computer

Computers are stupid. They do not understand anything other than numbers. And when I say numbers, these are not the numbers that you and I use everyday. They only recognise numbers which look like 110010101, 10010111, etc. This is not very surprising, since computers were originally designed to process (or as we techies sometimes say, crunch) calculations involving extremely large numbers, which no sane human being was willing to do (always remember - laziness is the mother of invention). Earliest computers had very little to do with text, and so as a result, even today, the computer understands nothing but numbers. And since the components of a computer ran on electricity, it was easier to handle information in terms of "On" or "Off" states, "On" represented "1" and Off represented "0". So internally, computers handled numbers as sequences of 1's and 0's, using the binary number system. Such a unit of information, which can be in either a state of "0" or "1" is called a bit.

By now, you must be thinking - but how on earth do I do all those word processing and emailing in my computer ?? Well, as the computing age advanced, computer scientists devised a very clever way to handle text. They simply assigned an unique number to each and every alphabet in their script (which was - no prizes for guessing this correctly, Latin). So after this, when you told the computer - "Hey, I am giving you some textual data", and if you entered the number, say 64, the computer stored it internally as 64, but displayed it to you as the character "A". Similarly, 65 stood for "B", 66 for "C" and so on. Moreover, there were some special characters called "Control Characters" signifying space, carriage return, tab, etc which instructed the computer to do visual adjustments to the displayed text. This entire system of character to number mapping was called encoding.

ASCII

At first, there was no standardised way of assigning numbers to the characters, and almost each and computer came with its own system of character to number mapping. However, when people started transferring data between computers they realised that they had to have a standardised way of representing characters, and thus they came up with a system which mapped the basic Latin characters, punctuations and control characters to the numbers ranging from 0-127. At the time this system was introduced, computers used eight group of bits as their smallest unit of information. However, the bit at the eight'th position was usually used for error checking, and hence, people could use only the first seven bits for any useful purpose. The maximum number of unique things that you can represent with seven bits (each of which could be either zero or one) is 27, ie, 128, and hence, we come across the 128 characters limit. This was called ASCII, which is an acronym for "American Standard Code for Information Interchange". ASCII was formally introduced as a standard in 1963, and is probably one of the most successful software standards to be ever released. Another quite successful (at that time) encoding standard was EBCDIC which stands for Extended Binary Coded Decimal Interchange Code. This was brought forward by IBM, and was mainly used in IBM mainframes.

However, ASCII (and EBCDIC) could only handle the base Latin characters (and maybe a few more for German and French) - ie. you could only handle plain and simple English (and a few other west European languages) with it. But, by this time computers had spread all over the world, and people wanted to use the systems in their own language. So they extended the 128 character limit of ASCII to 256, and used the members of the last 127 number block (128-255) to represent the extra characters that they needed. This was the general practise in European countries, while in countries from East Asia, they started to use even more fancier methods to do their encoding (since their character count ran to thousands). However, the net result of all this was that every country had a different codepage of their own, and computers used in those countries usually shipped with only that codepage enabled. In a nutshell, that meant that you couldn't read a Hebrew text file in a computer bought in Greece.

Unicode

To resolve this confusion - in the late eighties, some people decided to have a single Grand Unified standard called... (drums and trumpets please) Unicode. The broad (and very ambitious) aim of Unicode is to assign an unique number to each and every character of each and every "reasonable" writing system used on the earth, “no matter what the platform, no matter what the program, not matter what the language”. Apart from characters, Unicode also assigns numbers to various symbols and signs, like the various mathematical signs, the symbols used in musical notations, etc. However, before you jump to conclusions, let me very clearly state, that Unicode is not a "16-bit" encoding system, with each character taking up sixteen bits of storage space. Unicode simply assigns numbers to characters - and how those numbers are represented inside computers is a completely different story (where you would come across weird sounding technologies called UTF-8, UTF-16, UCS4). As a font developer, you need not worry much about methods of representing Unicode - but if you are interested, you can look at this article by Joel Splosky, or if you are of the more techie kind, you may want to read this article at the IBM website.

At the time of writing, the latest version of the Unicode standard (version 4.1) can handle the following Indic scripts (note that I mention script here - not language. Always remember that Unicode encodes scripts, not languages):

  • Devanagari

  • Bengali

  • Gurmukhi

  • Gujarati

  • Oriya

  • Tamil

  • Telegu

  • Kannada

  • Malayalam

  • Sinhala

  • Thai

  • Lao

  • Tibetan

  • Myanmar

Other scripts such as Ol-Chiki/Al-Chiki (used for Santhali) and Shyloti Nagri (used in the northern portions of Bangladesh) are in the pipeline and maybe added in the next major revision of Unicode.

However, if you go through the code charts for the various scripts, you will notice that Unicode only assigns numbers to the base characters of a particular script. That means displaying of conjuncts (sanyuktakshars/yuktakshars) and other combined characters is something which the relevant software in the computer and the font should do. Though this may seem to be weird at the first glance, in reality, it adds a lot of flexibility and freedom to the software developer and as well as the typographer. Of course, there is a bit more complexity as far as the software and the font is concerned, but the end product is vastly superior to what people have been using for all these years, offering much more freedom and flexibility.