The long road to a truly open Artificial Intelligence (AI)

Frederic Dupeux
Chief Information Security Officer
at Banque Havilland



The term “open-source” has become fashionable in the field of Artificial Intelligence (AI), with major players such as Meta and Elon Musk defending it. However, there is no consensus on the definition of open AI. This ambiguity allows leading companies to manipulate the concept to their advantage, which could strengthen their dominant position.


The rise of artificial intelligence raises many ethical, legal and conceptual questions within the open source community. While open source has a clear definition, i.e. source code that is accessible, modifiable and redistributable, this is not the case for open AI. Indeed, no consensual definition has been adopted due to divergent interests and the complexity of AI systems compared to traditional software. Unlike software, AI systems depend on large amounts of data and involve many components such as training data, pre-processing code and model architecture.

One of the major concerns of the open source community is, quite rightly, intellectual property rights when algorithms are developed on large quantities of data without knowledge of their origin. This uncertainty discourages some developers from sharing their data, which could hinder progress in the field of open source AI. This is a real battle between all the players in the field, with the performance of current models directly dependent on the volume of data ingested.

The complexity and lack of transparency of AI make it difficult to understand or rationalise AI decisions based solely on the source code, calling into question the concept of open AI. The generation of text, images, videos or code therefore raises licensing, security, and regulatory issues due to the lack of clarity over their origin.


“ While open source has a clear definition, (…), modifiable and redistributable, this is not the case for open AI. Indeed, no consensual definition has been adopted due to divergent interests and the complexity of AI systems compared to traditional software.”



From sharing to plundering

Historically, open source was born from the desire to share and the need for hardware providers to offer software for their machines. Today, this way of working is constantly evolving, encouraging innovation, collaboration and the sharing of knowledge within a diverse community. While software was central to the evolution of computing systems in the early decades, data has played a central role in AI advances over the last two decades.

Leading AI technology companies have adopted a variety of open source strategies. Some AI models are shared more freely than others. Meta, for example, has released its Llama 2 model as open source, while OpenAI has restricted access to its most powerful models. Google offers freely available Gemma models designed to compete with its competitors’ models. However, many models described as open source are accompanied by restrictions on use, in contradiction with the very principles of open source.

The use of data to create AI is one of the main sticking points. While pre-trained models are often shared, the datasets to form them are not, limiting the ability to fully modify and study these models. This lack of data transparency is a significant barrier to true openness in AI.

According to Aviya Skowron, head of policy and ethics at the non-profit AI research group EleutherAI, there is a lack of clarity over the use of copyrighted information in the formation of AI models. As for Stefano Zacchiroli, a professor at the Polytechnic Institute of Paris and a key player in the Open Source Initiative (OSI) definition process, believes that a full description of training data is essential for AI models to be considered open source.

Large companies are reluctant to share training data due to competitive advantages and regulatory concerns. This reluctance undermines the very ethics of open source and can only strengthen the power of large technology companies. According to the website, the GPT4 openAI model, as well as Meta’s Mistral/Mixtral and Lama2 models, would concentrate the most copyright violations. With 44% of content protected by copyright, GPT4 is by far the model that generates the most exact reproductions of protected content.

AI with a high societal impact is bound to be open

A clear and widely accepted definition of open source AI is urgently needed to prevent these powerful companies from dictating terms that suit their interests.

A truly open AI would have many advantages, such as promoting innovation, transparency, responsibility, fairness and human values – in short, an AI with a high societal and ethical impact. Open AI would make it possible to mitigate the main threats generated by AI, namely its malicious use and the perpetuation of prejudice and discrimination. Open AI would generate significant social and economic progress, particularly in sectors such as healthcare, education and finance.

For example, in the banking sector, the promise of AI is undeniable, particularly for fraud detection systems to better anticipate criminal activity. It is also possible to imagine a new form of customer relations or personalised financial advice that could be tailored to individual needs. Finally, AI should make it possible to envisage a new form of risk management and crisis forecasting. Learning algorithms could rapidly identify the early signs of a crisis to better manage its effects.

At a time of awareness of the incredible potentials of Artificial Intelligence, but also the threats that accompany it, it is urgent to propose a reflection on a responsible and ethical use of AI. Open source, based from its beginnings on values of sharing and transparency, can offer a path to AI aligned with our human values. This path will require ongoing collaboration among developers, researchers, and regulators to ensure its future.



(source: Agefi)