Some Japanese researchers feel that AI systems trained on foreign languages cannot grasp the intricacies of Japanese language and culture.
Japan is building its own versions of ChatGPT — the artificial intelligence (AI) chatbot made by US firm OpenAI that became a worldwide sensation after it was unveiled just under a year ago.
The Japanese government and big technology firms such as NEC, Fujitsu and SoftBank are sinking hundreds of millions of dollars into creating AI systems that are based on the same underlying technology, known as large language models (LLMs), but that use the Japanese language, rather than translations of the English version.
“Current public LLMs, such as GPT, excel in English, but often fall short in Japanese due to differences in the alphabet system, limited data and other factors,” says Keisuke Sakaguchi, a researcher at Tohoku University in Japan who specializes in natural language processing.
English bias
LLMs typically use huge amounts of data from publicly available sources to learn the patterns of natural speech and prose. They are trained to predict the next word on the basis of previous words in a piece of text. The vast majority of the text that ChatGPT’s previous model, GPT-3, was trained on was in English.
ChatGPT’s eerie ability to hold human-like conversations, has both delighted and concerned researchers. Some see it as a potential labour-saving tool; others worry that it could be used fabricate scientific papers or data.
In Japan, there’s a concern that AI systems trained on data sets in other languages cannot grasp the intricacies of Japan’s language and culture. The structure of sentences in Japanese is completely different from English. ChatGPT must therefore translate a Japanese query into English, find the answer and then translate the response back into Japanese.
Whereas English has just 26 letters, written Japanese consists of two sets of 48 basic characters, plus 2,136 regularly used Chinese characters, or kanji. Most kanji have two or more pronunciations, and a further 50,000 or so rarely used kanji exist. Given that complexity, it is not surprising that ChatGPT can stumble with the language.
In Japanese, ChatGPT “sometimes generates extremely rare characters that most people have never seen before, and weird unknown words result”, says Sakaguchi.
Cultural norms
For an LLM to be useful and even commercially viable, it needs to accurately reflect cultural practices as well as language. If ChatGPT is prompted to write a job-application e-mail in Japanese, for instance, it might omit standard expressions of politeness, and look like an obvious translation from English.
To gauge how sensitive LLMs are to Japanese culture, a group of researchers launched Rakuda, a ranking of how well LLMs can answer open-ended questions on Japanese topics. Rakuda co-founder Sam Passaglia and his colleagues asked ChatGPT to compare the fluidity and cultural appropriateness of answers to standard prompts. Their use of the tool to rank the results was based on a preprint published in June that showed that GPT-4 agrees with human reviewers 87% of the time1. The best open-source Japanese LLM ranks fourth on Rakuda, while in first place, perhaps unsurprisingly given that it is also the judge of the competition, is GPT-4.
“Certainly Japanese LLMs are getting much better, but they are far behind GPT-4,” says Passaglia, a physicist at the University of Tokyo who studies Japanese language models. But there is no reason in principle, he says, that a Japanese LLM couldn’t equal or surpass GPT-4 in future. “This is not technically insurmountable, but just a question of resources.”
One large effort to create a Japanese LLM is using the Japanese supercomputer Fugaku, one of the world’s fastest, training it mainly on Japanese-language input. Backed by the Tokyo Institute of Technology, Tohoku University, Fujitsu and the government-funded RIKEN group of research centres, the resulting LLM is expected to be released next year. It will join other open-source LLMs in making its code available to all users, unlike GPT-4 and other proprietary models. According to Sakaguchi, who is involved in the project, the team hopes to give it at least 30 billion parameters, which are values that influence its output and can serve as a yardstick for its size.
However, the Fugaku LLM might be succeded by an even larger one. Japan’s Ministry of Education, Culture, Sports, Science and Technology is funding the creation of a Japanese AI program tuned to scientific needs that will generate scientific hypotheses by learning from published research, speeding up identification of targets for enquiry. The model could start off at 100 billion parameters, which would be just over half the size of GPT-3, and would be expanded over time.
“We hope to dramatically accelerate the scientific research cycle and expand the search space,” Makoto Taiji, deputy director at RIKEN Center for Biosystems Dynamics Research, says of the project. The LLM could cost at least ¥30 billion (US$204 million) to develop and is expected to be publicly released in 2031.
Expanding capabilities
Other Japanese companies are already commercializing, or planning to commercialize, their own LLM technologies. Supercomputer maker NEC began using its generative AI based on Japanese language in May, and claims it reduces the time required to create internal reports by 50% and internal software source code by 80%. In July, the company began offering customizable generative AI services to customers.
Masafumi Oyamada, senior principal researcher at NEC Data Science Laboratories, says that it can be used “in a wide range of industries, such as finance, transportation and logistics, distribution and manufacturing”. He adds that researchers could put it to work writing code, helping to write and edit papers and surveying existing published papers, among other tasks.
Japanese telecommunications firm SoftBank, meanwhile, is investing some ¥20 billion into generative AI trained on Japanese text and plans to launch its own LLM next year. Softbank, which has 40 million customers and a partnership with OpenAI investor Microsoft, says it aims to help companies digitize their businesses and increase productivity. SoftBank expects that its LLM will be used by universities, research institutions and other organizations.
Meanwhile, Japanese researchers hope that a precise, effective and made-in-Japan AI chatbot could help to accelerate science and bridge the gap between Japan and the rest of the world.
“If a Japanese version of ChatGPT can be made accurate, it is expected to bring better results for people who want to learn Japanese or conduct research on Japan,” says Shotaro Kinoshita, a researcher in medical technology at the Keio University School of Medicine in Tokyo. “As a result, there may be a positive impact on international joint research.”
doi: https://doi.org/10.1038/d41586-023-02868-z
- Adenman
- 1
Recommended Comments
There are no comments to display.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.