Breaking AI impasse

0
1288
LinkedIn
Facebook
Twitter
Whatsapp
Telegram
Copy link

How do we reconcile the pressing needs for tech advancement with the time-tested rules for IP protection? Wang Sijia, director of IP and data compliance at NetEase Group, shares her views

AFTER ABSORBING an unfathomable amount of data for machine learning, generative artificial intelligence (AI) – most prominently OpenAI’s ChatGPT – now wields a wealth of functions ranging from writing articles and generating images to summarising abstracts and enhancing pictures. This has not only significantly lowered the threshold for professional creation, but fundamentally subverted the underlying logic behind traditional content creation, setting off a shockwave in global cultural and creative industries.

At present, what legal issues plague the development of AI technology? What can AI enterprises do to balance tech development and risk control? Tapping into her more than 12 years of IP and compliance experience in the technology sector, Wang Sijia, director of IP and data compliance at NetEase Group, offers her unique insights to China Business Law Journal.

Wang Sijia

CBLJ: What are the main challenges facing the development of AI technology?

Wang Sijia: From the perspective of legal compliance, the development of AI technology faces many and extremely diverse challenges ranging from IP, data bias and discrimination, and personal information and privacy protection, to data security and trade secrets. Among these, IP-related issues are without a doubt the most critical, as well as the most intimately connected with technological development.

Since 2022, a stream of judicial and administrative penalty cases involving AI-generated content (AIGC) has emerged in major jurisdictions around the world. The US has the highest number of such cases, as well as the most active ones, which is no surprise given its leading role in generative AI technology and its broad application. This is followed by China and the EU, while Japan has hosted only a small number.

When we take a closer look at the more high-profile litigation cases related to generative AI in the US, three things stand out. First, the vast majority of them involve IP.

Second, the disputes focus on two types of issues: whether the use of copyrighted works by generative AI for machine learning constitutes infringement; and whether AIGC is work deserving the protection of copyright law, with the former being the more prominent. One notable case is The New York Times v OpenAI and Microsoft, initiated on 27 December 2023, in which the The New York Times expressly requested in its complaint that all training datasets to which it owned the copyrights be destroyed. This claim directly took aim at datasets on which machine learning depends, the bread and butter of generative AI technology.

Third, the proportion of class action suits is relatively high. For example, three US artists, Sarah Andersen, Kelly McKernan and Karla Ortiz, instituted a class action suit on behalf of others against Stability and Midjourney in January 2023. Additionally, Paul Tremblay and Mona Awad, two US authors, spearheaded a class action suit against OpenAI on 28 June 2023.

Those familiar with US case law will understand that class action suits are often convoked and instituted by specialised attorneys who are likely to adopt a contingency fee model, which means that if a case is won they may charge more than 40% of the award. Facing such an enormous litigation burden and irrevocable reputational damages, AI enterprises often pay a king’s ransom to settle disputes.

For AI enterprises, these challenges and difficulties are very real and, judging from a practical perspective, all about the application. In other words, they are governance and compliance issues that arise after the AI technology has been developed and put into action. However, for many AI startups still at the dawn of their technological exploration, with application all but out of reach, their concern of the day is hardly how to govern AI, but how to create their proprietary large models.

OpenAI has achieved world-renowned success, and there is nothing whimsical about their formula – simply feed an ocean of data into the learning machine and wait for it to click.

AI technology development is driven by a powerful trio: algorithms, computing power and data, each as indispensable as the next. In the past, we thought that computing power was a bottleneck for AI in China because of its reliance on chips, which, for external reasons, have become hard to come by.

But now it has gradually dawned on us that, for Chinese enterprises hoping for a breakthrough in large models, the greatest impediment is not computing power but the absence of a large-scale, high-quality Chinese-language training dataset necessary for machine learning. After all, computer calculation can be fast or slow, but datasets? You either have one or you don’t.

Yes, Chinese is the oldest language in use today, with 5,000 years of cultural accumulation, but it has not been put together quite like that. Compared with enormous, easily accessible, and open-source English datasets such as Common Crawl (for web data), Project Gutenberg (for book information) and Wikipedia, which are extensively tapped into by foreign large models, we fall far behind in terms of both quantity and quality. Moreover, not one Chinese enterprise has the capacity to complete the collection and collation of such a huge corpus on its own. So, for the moment, we do not have fish in the pond, nor have we learned fishing.

You must be a subscribersubscribersubscribersubscriber to read this content, please subscribesubscribesubscribesubscribe today.

For group subscribers, please click here to access.
Interested in group subscription? Please contact us.

你需要登录去解锁本文内容。欢迎注册账号。如果想阅读月刊所有文章,欢迎成为我们的订阅会员成为我们的订阅会员

已有集团订阅,可点击此处继续浏览。
如对集团订阅感兴趣,请联络我们

LinkedIn
Facebook
Twitter
Whatsapp
Telegram
Copy link