While imprisoned for being a “reactionary,” physicist and engineer Zhi Bingyi began devising a system to help computing machines read Chinese characters.
This story is adapted from Kingdom of Characters: The Language Revolution That Made China Modern, by Jing Tsu.
It was 1968, two years into the Cultural Revolution. Shanghai was in the middle of an unseasonal heat wave, and its people cursed the “autumn tiger.” Zhi Bingyi had more to worry about than the heat. He had been branded a “reactionary academic authority,” one of the many damning allegations that sent millions of people to their deaths or to labor camps during the Cultural Revolution. Was it still appropriate for Zhi to think of himself as one of the people? Hadn’t he betrayed them, as he’d been told?
Just four years earlier, Zhi had gone to work every day as director of the newly established Shanghai Municipal Electric Instrument and Research Office under the government’s First Ministry of Machinery Industry. It was one of the most secure jobs one could have. First Ministry was in charge of building heavy industrial machines in the early period of New China, and later split off a Fourth Ministry to oversee electronic communications technology. Zhi’s specialty was electric metering—focusing on precision meters and electronic modeling by enhancing the performance of a device’s various parts.
Quiet, cautious, and insistent, Zhi was also highly qualified. He earned a PhD in physics from Leipzig University but declined a job offer in the United States in order to return to China. He taught at two Chinese universities and later helped to devise China’s landmark 12-year Plan for the Development of Science and Technology of 1956. It was a hopeful time for scientists and technicians who were deemed useful for their contributing roles in a state-guided socialist economy.
Since his arrest in July 1968 for being a “reactionary academic authority,” Zhi had been cut off from his research, the news, and his devoted German wife. He was used to working on equations and engineering problems with teams of colleagues. No longer. His only company was the eight characters on the wall of his cell reminding him that prisoners faced two options from their minders: “Leniency to those who confess, severity to those who refuse.”
The purge of the intellectual class had just begun, and anyone who was educated had to bow to the tenets of class struggle and the will of the Gang of Four—the radical contingent of the Chinese Communist Party. Many were sent to the countryside to be reformed through backbreaking labor, picking through manure and tilling fallow fields in the heat and rain with little to eat. They were held to the strictest military discipline in camps that doubled as “reeducation” centers. So successful was Mao’s anti-intellectual campaign that it inspired Pol Pot to launch a similar crusade in Cambodia between 1975 and 1979, killing anyone who wore eyeglasses—incriminating evidence of bourgeois intellectualism.
In the cowshed, Zhi stared at the eight characters on the wall. One day, he no longer saw the ominous message but instead the strokes and characters of which it was composed. He began to notice where the ink thickened, blotched, or trailed off at the ends of each character. Every stroke appeared to him anew, each an enigma with a fresh riddle. Though they were created by a human hand, he realized, each character was essentially repeating combinations of the same abstract strokes and dots.
How would one translate and turn these human-made brushstrokes into a coded language that could be entered into computing machines? It was not the first time someone had thought of rendering Chinese characters systematically into codes, of course. The same question had crossed Count d’Escayrac’s mind more than a century earlier in another prison—the urine-soaked cell of imperial Beijing. And coded language was fiercely defended as a question of national sovereignty in the marble halls of Paris in 1925 and attempted as telegraphic encryption.
But it never would have occurred to any of them to come up with a solution for a machine. Every solution of theirs had been oriented toward the human user—how to organize characters so they are easier for people to write and to learn, less taxing and time-consuming to memorize or look up.
The question in Zhi’s mind burned to a different purpose: How could one render Chinese in a language that computers can read—in the zeros and ones of binary code? Having been used to building computer models of his electrical devices, he would have come across the problem many times.
To bridge to the state of technology in the advanced world in the 1970s, China had begun to build machines that could handle mass-scale calculations, sieve through huge amounts of information, and coordinate complex operations. The data for calculating and controlling flight paths, military targets, and geographical positioning, or tracking agricultural and industrial output, had to be collected first. Yet all the existing records, documents, and reports were in Chinese. It became clear that in order to be part of the computing age at all, the Chinese script would have to be rendered digitally. Western computing technology was also moving in the direction of text processing and communication, not just running large-scale calculations. Converting human language scripts into digital form was the next frontier. The arms race during the Cold War was advancing the state of computing technology in both the Soviet Union and the United States. Getting Chinese inside the machine was critical to ensuring that China was not left out.
Requiring precise inputs, computing machines are unforgiving of inconsistencies and exceptions. All the characteristics of Chinese that stymied earlier innovators—the unwieldy size of its character inventory; its complex strokes, tones, and homophones; the difficulty of segmentation—created new challenges in the digitization of the script. Executable commands could only be in the form of a yes or a no, an on-or-off switch of an electric current running through the circuitry of a computer control board. No partial solutions or patches would help China get by, this time. During Zhi’s incarceration, China was in the throes of its biggest social and political upheaval yet and hardly had the resources to make such a bid for the future.
But for a country so far behind the Western world, science and technology were not just a barrier. They were viewed as essential for helping China leapfrog out of backwardness and speed up the process of modernization. The challenge was multifaceted: to devise a code for Chinese that is easy for humans to remember and use and that can be entered into a machine via punched tape or keyboard; to find a way for the machine to store the massive amount of information required to identify and reproduce Chinese characters; and to be able to retrieve and restore the script with pinpoint precision, on paper or on a screen.
Zhi knew he could tackle the first, critical step: how best to input Chinese into the machine. That meant figuring out a way to represent each character in a language that the human operator and the machine could both understand: as a finite set of zeros and ones entered directly into the machine, or in the alphabetic letters on which computer programming languages were already built. The latter seemed more promising. Mapping characters onto the alphabet immediately led to other questions, however: How many alphabet letters would it take to uniquely encode a single character? Should the spelling of characters be abbreviated like acronyms? And what should serve as the basis of the acronyms— characters, components, or strokes?
Zhi needed a pen and paper to test each hypothesis, but the guards did not even give him toilet paper, let alone something to write on. He looked around and saw the only viable object in the room—a teacup. With that modest vessel of worship, Zhi began his own personal pilgrimage. Each day, with a stolen pen, he inscribed as many characters as he could onto the matte ceramic teacup’s lid, testing out each character with a set of possible Roman letters, then wiped it clean. He squeezed dozens of characters at a time onto the curved surface, relying on memory to keep track of his incremental efforts.
He aimed for every character to have some kind of intuitive but unique relationship to the alphabetic code representing it. There were two known ways of doing so, by sound or shape. Zhi’s predecessors preferred shape-based analysis, taking strokes and components and rearranging them into classifiable categories, but the adoption of the Romanization system of pinyin had made the phonetic approach the national and international language standardization policy. While pinyin solved the problem of phonetic standardization, it did not make the old problems go away. For one thing, it made the issue of homophones worse because so many characters were now spelled identically in alphabetic form. There were only so many ways to spell the pronunciations of different characters with the alphabet’s 26 letters, and they ran out more quickly than the thousands of individually distinct characters. Zhi decided to utilize the best of phonetic Romanization and shape-based cues to make his own encoding process as predictable and logical as possible. The idea was not destined to rot in jail.
In September 1969, Zhi was released after 14 months. Upon release, Zhi was assigned to lowly positions as part of his rehabilitation: sweeping floors, shaping tools in a factory, standing guard at a warehouse. He found it a blessing to be a nobody and went right back to his encoding scheme. He used the warehouse as his study to stash the foreign journal articles and newspapers he had scavenged. He was excited to learn that Japan had been making progress on resolving the problem. Much like what had been done with Chinese typewriters, they were using radical parts of characters to locate, retrieve, and print them on the computer screen. But the Japanese keyboard included more than 3,600 characters, each taking up one key, which was impractical. A company in Australia was also using the radical system to retrieve characters. Using a more modest keyboard of 33 keys, they were able to access close to 200 characters at any time with the stroke of one key, which was an improvement over the Japanese, but still not enough characters for the Chinese. Then there was the United States, where experimental models were using 44 keys, and—as Zhi would later learn—an even more ambitious project was underway to computerize Chinese printing at the Graphic Arts Research Foundation in Massachusetts. Scholars in Taiwan, meanwhile, were developing their own input systems for traditional characters.
Zhi felt greatly encouraged. His solitary work was running parallel to these larger efforts. Most of them, though, still had not been able to free themselves from clunky keyboards. While breaking down characters into components had worked well enough for specific character retrieval indexes and typewriter keyboard designs, it did not translate directly into programming such a process for a computing machine.
Zhi remembered the advantage of the shape-based approach, where character parts helped to identify the whole character directly. To integrate that useful principle into his encoding scheme, Zhi decided to index characters by their components—the simpler characters within each ideograph—using the first letter of each component’s pinyin spelling.
The idea took another two years to flesh out. On average, characters can be broken into two to four components, and there are 300 to 400 components in total. The majority of characters can be divided into two halves—vertical or horizontal—along with other possible geometries. This yielded a two-to-four-letter alphabetic code for each character, which meant each character required at most four keystrokes on a conventional English keyboard. The average English word length, by comparison, is close to 4.8 letters. Zhi thus made the alphabet work more efficiently for individual ideographs than it did for English. The system also cleverly worked around the problem of dialect difference and homophones. Because the code took only the first letter, rather than the complete sound of the character, most regional speech variations did not matter. The four-letter code worked like an acronym of the different parts of the character. Zhi essentially used the alphabet as a proxy to spell by components rather than words.
He sequenced each character’s components in the order they would have been written by hand. Coding by components gave context and important cues that reduced ambiguity and the risk of duplicated codes. The chances of having the same components—or even components starting with the same letter—occur in the exact same order in two different characters are low.
Zhi’s way of indexing the Chinese character by its alphabetized components made it easier for humans to input Chinese—as long as you knew how to write the language—and created a more systematic human-machine interface. For instance, in his system, the character for “road,” 路 (lu), which has 13 strokes by hand, can be broken up into a mere four components: 口 (kou) , 止 (zhi), 攵 (pu), and 口 (kou). Isolating the first letter of each component gives the character code of KZPK. Or take the character 吴 (wu), a common last name, which can be quickly decomposed into two parts, 口 (kou) and 天 (tian), yielding a character code of KT.
Alphabetic spelling, once mediated by Chinese in this way, is no longer a phonetic but a semantic spelling system, where each letter actually stands for a character rather than a sound. This method of indexing can also be extended to represent groups of characters. Take, for instance, “socialism,” or shehui zhuyi: 社会主义. By tagging the first letter of each of the four characters in the phrase, the phrase can be coded in a four-letter sequence, SHZY. Or consider another frequently invoked phrase, the seven characters that make up “People’s Republic of China”—Zhonghua renmin gongheguo: 中华人民共和国. It can simply be typed in as ZHRMGHG.
Zhi’s coding system could also include properties that are not strictly phonetic. Additional letters could add the pronunciation of the whole character or its shape pattern to the basic four-letter component-based code. The character 路 has the phonetic pronunciation of “lu” and, because it can be divided into two vertical halves, has a zuo you (left-right) structure. Both features can be indicated in the extended code KZPKLZ. The more precise you can be about encoding the information of a character, the more useful that code can be. These extensions of Zhi’s system would be important for Chinese-language applications in machine translation and retrieving information from stored data.
Zhi formally introduced his “On-Sight” encoding system in the Chinese science journal Nature Magazine in 1978. He described his system as topological—extrapolated from the geometry of parts. With four-letter codes using all 26 letters of the alphabet, there were enough combinations to generate 456,976 possible unique codes. Zhi claimed for his system an efficiency similar to that of Morse code— quick, intuitive, and transparent.
News of Zhi’s feat spread, galvanized by the political fervor for science and technology that broke out after Mao’s death in 1976. On the front page of Shanghai’s Wenhui Daily, on July 19, 1978, the editor euphorically announced, “The Chinese Script Has Entered the Computing Machine.”
Computers could finally “understand” square-shape characters. After more than a decade of isolation, China could at last have a shot at communicating with the world and managing its own flow of information digitally.
From Kingdom of Characters: The Language Revolution That Made China Modern by Jing Tsu, published by Riverhead, an imprint of Penguin Publishing Group, a division of Penguin Random House, LLC. Copyright (c) 2022 by Jing Tsu.
- scarabou and Karlston
- 2
Recommended Comments
There are no comments to display.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.