Open Source AI: To Release or Not To Release Large AI Models?

💙

In February 2019, OpenAI introduced GPT-2, a groundbreaking language model with multiple capabilities including reading comprehension, machine translation, and summarization. Due to concerns about potential misuse, OpenAI initially released only a smaller version of GPT-2. After analyzing the risks and observing no strong evidence of misuse, OpenAI gradually released larger models, culminating in the full GPT-2 release in November 2019. GPT-2 was trained as a large-scale unsupervised language model on 40 GBs of content scraped from the Internet with a Reddit karma score (calculated based on the user upvotes of the content) of over 3 (University, S.C., 2020). GPT-2 was trained on a large dataset with unsupervised styles, and its introduction started discussions on the balance between potential societal harms and benefits among researchers. The steps of AI development are rapid, large AI models have reaped the benefits of extensive data resources available on the Internet and the fast growth in computational power. Over time, increasingly powerful AI models, such as Chat-GPT, GPT-4, and DALL-E, are being introduced to the market. Nevertheless, various ethical issues associated with these large AI models are progressively coming to light.

This article will delve into the ethical challenges associated with the publication of a large AI model, focusing on the potential risks and benefits it presents to society. I will examine the dilemma: “To Release or Not To Release” faced by AI developers and researchers in balancing openness and collaboration with the need to prevent the misuse of powerful language models.

💜

Open source refers to a development model in which the source code of a software or application is made freely available to the public. This allows developers, researchers, and users to access, modify, and distribute the codes without restrictions. Open-source software has gained popularity across various domains, including machine learning and AI. Open-source machine learning models are an extension of this concept, where pre-built algorithms and models are shared openly for everyone to use, adapt, and improve. By providing access to these models, developers can quickly implement machine learning models, reducing the time and resources needed for building models from scratch. Additionally, open-source machine learning models enable the public to try state-of-the-art AI and use it to develop their own applications.

Initially, OpenAI expressed concerns regarding the potential use of large language models for generating misleading, biased, or harmful language on a massive scale. However, at the point when they fully released GPT-2, they stated that they have seen no strong evidence of misuse so far. As for GPT-3 and the following models, OpenAI followed their original promise about not releasing the model. For the full release of AI models, many stakeholders are involved. Firstly, the company that invented the model, like OpenAI, and then the people who intend to use it, like the competition companies, the research community, the companies’ paid and unpaid users, as well as the people who are “second-hand” influenced by the AI products. Even though they are not directly involved with using the AI models, they are involved in the applications which were directly generated by AI users.

❤️

Around the year 2000, NLP progress was focused on shallow parsing, but now language models are capable of generating coherent and comprehensive text, even assisting humans in complex tasks. It’s difficult to imagine the incredible capabilities of future AI models. In a complete survey on generative ai (AIGC), the authors think generative AI is still in its early stage, and its future development may be divided into the following three directions (Zhang et al., 2023). First, AIGC tasks are trending towards having more flexible control. For example, while early GAN-based models could generate high-quality images, recent diffusion models enable control through text instructions. In the future, more fine-grained control is needed for more flexible image generation. Imagine an AI model that can generate images based on anything you want, and fits your expectation perfectly. Second, the authors believe that the focus of AIGC models will shift from pretraining to finetuning, in other words, researchers will study more on the downstream tasks and new tasks instead of studying the model structure. Third, along with the focus shifts from core technology development to applications, the authors think more startup companies, like OpenAI, are expected to emerge due to increasing demand.

More and heavier ethical conflicts arise from future AI area development. The first direction shows the accessibility of more serious misuse of AI. Even for the GPT-3 API usages, the study found that extremists could effortlessly generate artificial text with minor modifications. Automation employment allows the quick spread of evil ideological and emotionally provocative content across online platforms, making it much more difficult for people to differentiate from human-created content. These synthetic forums could be employed to recruit new followers and increase the engagement of existing users (McGuffie & Newhouse, 2020). So far it has been the risks of current AI models, more detailed and vivid texts and images are generated to attack certain individuals, races, and countries in the future might not just an imagination. The second point is related to our open-source discussion. Only providing an API connection to the pre-trained model allows the public to use it in many aspects and gives researchers plenty of downstream tasks to discover. That leads to the question of whether open source codes on AI models still benefit society and the research community more than they harm in the Utilitarianism lens.

AI-generated models have demonstrated remarkable potential in various applications such as natural language processing, image generation, video synthesis, 3D content creation, and code development. Their usages across numerous industries like education, entertainment, marketing, and creative content production, ultimately lead to increased efficiency and innovation. However, do these benefits worth more than the ethical and societal challenges posed by these powerful tools? Utilitarianism considers the total impact of an action, as long as the company can manage risks while maximizing positive outcomes for the greatest number of people, open-sourced AI models seem more beneficial to society.

The question: Should the Large AI model be open-sourced, can further become the debate of whether should people develop large AI models or should AI models be so powerful. Timnit Gebru is known for studying algorithmic bias fairness in machine learning and is the founder of Black in AI. She published a paper that discusses the potential risks and ethical concerns surrounding large-scale AI language models, such as data bias, environmental impact, and the concentration of power in the hands of a few tech companies. The paper emphasizes the need for more interdisciplinary research, transparency, and accountability in AI development. On one hand, open-source models allow people to examine potential risks. On the other hand, in order to study such large language models requires numerous computational power; the carbon footprint of training a large AI model roughly equals 10 cars’ lifelong emissions. Therefore, the paper argues that bigger language models might not always be better and that the AI research community should consider the potential negative impacts of such models on marginalized communities and society as a whole (Bender et al., 2021). Small open-sourced models sound like a good plan, but it doesn’t help much in deciding whether large AI models should be open-sourced.

Interestingly enough, Google fired her right after she tried to publish this paper, which exactly matched her description of “the concentration of power in the hands of top tech companies” in this paper. Those high-tech companies are the biggest stakeholders in the publishing of large AI models, their evaluation is not based on the utilitarian perspective but only relates to their own advantages. “After hiring researchers like Dr. Gebru, Google has painted itself as a company dedicated to “ethical” A.I. But it is often reluctant to publicly acknowledge flaws in its own systems,” commented by the New York Times (Metz & Wakabayashi, 2020). We won’t be able to know if the decision not to release large AI models of Open-AI is purely based on ethical concerns or company benefit. Chat-GPT surely is the most popular AI model, it’s understandable that the company wants to keep the money machine to itself, but it’s important that the company pay attention to the ethical issues and social impacts.

In contrast, most of Google’s large language models are open-sourced. This is worth examining through the lens of fairness. The development of Open-AI builds upon all the previous open-sourced models. If all companies stop publishing their models, how can the AI research community progress? Open-AI researchers took advantage of other people’s open-source codes, but don’t give back to the community, is this a fair behavior? Powerful AI models depend heavily on pre-training, and withholding model code helps save training resources to some degree, therefore resulting in alleviating the unfairness in resources between big companies and individuals. Would it be considered an act of fairness?

The ethical dilemma of the “To Release or Not To Release” puts us in a tricky situation where both positive and negative impacts of the release are complicated to compare with each other. Many other ethical lenses can also aid in analyzing the overall consequence on society. Follows the right not to be injured lens: no one would want to be harmed by AI models or their by-products, however, do we freely and knowingly choose to risk such injuries? Through the common good ethical lens, we know that sharing something good among the public benefits society and all, which is for the common good, then not sharing AI source which potentially damages society and creates chaos should also be considered for the common good.

In the absence of extensive training, funding, or scientific support, the potential for the development and deployment of extremist content using open-source large AI models is less threatening. Studies have shown that the provided API is sufficient to generate any kind of content since only the companies have enough resources to obtain state-of-the-art models. It is the AI companies’ responsibility to examine API usage and perform data and content control. The phony AI models trained for harmful use won’t be as harmful as fine-tuned API generation with the original model. The threats of future unrestricted access to large AI models should be more comprehensively described. Bottlenecks of the current AI models such as the AI-generated human figures don’t have hands or the image having a wired vibe showing the unrealness, which will easily be solved with time. The main focus of the company should be to ensure the responsible and beneficial utilization of AI-generated content in the future.

Reference

(University, S. C. (n.d.). Open source ai: To release or not to release the gpt-2 synthetic text generator. Retrieved April 29, 2023

Zhang, C., Zhang, C., Zheng, S., Qiao, Y., Li, C., Zhang, M., Dam, S. K., Thwal, C. M., Tun, Y. L., Huy, L. L., kim, D., Bae, S.-H., Lee, L.-H., Yang, Y., Shen, H. T., Kweon, I. S., & Hong, C. S. (2023). A complete survey on generative ai (Aigc): Is chatgpt from gpt-4 to gpt-5 all you need? arXiv.

McGuffie, K., & Newhouse, A. (2020). The radicalization risks of gpt-3 and advanced neural language models. Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623.

Metz, C., & Wakabayashi, D. (2020, December 3). Google researcher says she was fired over paper highlighting bias in a. I. The New York Times.

--- title: "Open Source AI: To Release or Not To Release Large AI Models?" --- ## :blue_heart: In February 2019, OpenAI introduced GPT-2, a groundbreaking language model with multiple capabilities including reading comprehension, machine translation, and summarization. Due to concerns about potential misuse, OpenAI initially released only a smaller version of GPT-2. After analyzing the risks and observing no strong evidence of misuse, OpenAI gradually released larger models, culminating in the full GPT-2 release in November 2019. GPT-2 was trained as a large-scale unsupervised language model on 40 GBs of content scraped from the Internet with a Reddit karma score (calculated based on the user upvotes of the content) of over 3 (University, S.C., 2020). GPT-2 was trained on a large dataset with unsupervised styles, and its introduction started discussions on the balance between potential societal harms and benefits among researchers. The steps of AI development are rapid, large AI models have reaped the benefits of extensive data resources available on the Internet and the fast growth in computational power. Over time, increasingly powerful AI models, such as Chat-GPT, GPT-4, and DALL-E, are being introduced to the market. Nevertheless, various ethical issues associated with these large AI models are progressively coming to light. This article will delve into the ethical challenges associated with the publication of a large AI model, focusing on the potential risks and benefits it presents to society. I will examine the dilemma: “To Release or Not To Release” faced by AI developers and researchers in balancing openness and collaboration with the need to prevent the misuse of powerful language models. ## :purple_heart: Open source refers to a development model in which the source code of a software or application is made freely available to the public. This allows developers, researchers, and users to access, modify, and distribute the codes without restrictions. Open-source software has gained popularity across various domains, including machine learning and AI. Open-source machine learning models are an extension of this concept, where pre-built algorithms and models are shared openly for everyone to use, adapt, and improve. By providing access to these models, developers can quickly implement machine learning models, reducing the time and resources needed for building models from scratch. Additionally, open-source machine learning models enable the public to try state-of-the-art AI and use it to develop their own applications. Initially, OpenAI expressed concerns regarding the potential use of large language models for generating misleading, biased, or harmful language on a massive scale. However, at the point when they fully released GPT-2, they stated that they have seen no strong evidence of misuse so far. As for GPT-3 and the following models, OpenAI followed their original promise about not releasing the model. For the full release of AI models, many stakeholders are involved. Firstly, the company that invented the model, like OpenAI, and then the people who intend to use it, like the competition companies, the research community, the companies’ paid and unpaid users, as well as the people who are “second-hand” influenced by the AI products. Even though they are not directly involved with using the AI models, they are involved in the applications which were directly generated by AI users. ## :heart: Around the year 2000, NLP progress was focused on shallow parsing, but now language models are capable of generating coherent and comprehensive text, even assisting humans in complex tasks. It's difficult to imagine the incredible capabilities of future AI models. In a complete survey on generative ai (AIGC), the authors think generative AI is still in its early stage, and its future development may be divided into the following three directions (Zhang et al., 2023). First, AIGC tasks are trending towards having more flexible control. For example, while early GAN-based models could generate high-quality images, recent diffusion models enable control through text instructions. In the future, more fine-grained control is needed for more flexible image generation. Imagine an AI model that can generate images based on anything you want, and fits your expectation perfectly. Second, the authors believe that the focus of AIGC models will shift from pretraining to finetuning, in other words, researchers will study more on the downstream tasks and new tasks instead of studying the model structure. Third, along with the focus shifts from core technology development to applications, the authors think more startup companies, like OpenAI, are expected to emerge due to increasing demand. More and heavier ethical conflicts arise from future AI area development. The first direction shows the accessibility of more serious misuse of AI. Even for the GPT-3 API usages, the study found that extremists could effortlessly generate artificial text with minor modifications. Automation employment allows the quick spread of evil ideological and emotionally provocative content across online platforms, making it much more difficult for people to differentiate from human-created content. These synthetic forums could be employed to recruit new followers and increase the engagement of existing users (McGuffie & Newhouse, 2020). So far it has been the risks of current AI models, more detailed and vivid texts and images are generated to attack certain individuals, races, and countries in the future might not just an imagination. The second point is related to our open-source discussion. Only providing an API connection to the pre-trained model allows the public to use it in many aspects and gives researchers plenty of downstream tasks to discover. That leads to the question of whether open source codes on AI models still benefit society and the research community more than they harm in the Utilitarianism lens. AI-generated models have demonstrated remarkable potential in various applications such as natural language processing, image generation, video synthesis, 3D content creation, and code development. Their usages across numerous industries like education, entertainment, marketing, and creative content production, ultimately lead to increased efficiency and innovation. However, do these benefits worth more than the ethical and societal challenges posed by these powerful tools? Utilitarianism considers the total impact of an action, as long as the company can manage risks while maximizing positive outcomes for the greatest number of people, open-sourced AI models seem more beneficial to society. The question: Should the Large AI model be open-sourced, can further become the debate of whether should people develop large AI models or should AI models be so powerful. Timnit Gebru is known for studying algorithmic bias fairness in machine learning and is the founder of Black in AI. She published a paper that discusses the potential risks and ethical concerns surrounding large-scale AI language models, such as data bias, environmental impact, and the concentration of power in the hands of a few tech companies. The paper emphasizes the need for more interdisciplinary research, transparency, and accountability in AI development. On one hand, open-source models allow people to examine potential risks. On the other hand, in order to study such large language models requires numerous computational power; the carbon footprint of training a large AI model roughly equals 10 cars' lifelong emissions. Therefore, the paper argues that bigger language models might not always be better and that the AI research community should consider the potential negative impacts of such models on marginalized communities and society as a whole (Bender et al., 2021). Small open-sourced models sound like a good plan, but it doesn’t help much in deciding whether large AI models should be open-sourced. Interestingly enough, Google fired her right after she tried to publish this paper, which exactly matched her description of “the concentration of power in the hands of top tech companies” in this paper. Those high-tech companies are the biggest stakeholders in the publishing of large AI models, their evaluation is not based on the utilitarian perspective but only relates to their own advantages. "After hiring researchers like Dr. Gebru, Google has painted itself as a company dedicated to “ethical” A.I. But it is often reluctant to publicly acknowledge flaws in its own systems,” commented by the New York Times (Metz & Wakabayashi, 2020). We won’t be able to know if the decision not to release large AI models of Open-AI is purely based on ethical concerns or company benefit. Chat-GPT surely is the most popular AI model, it’s understandable that the company wants to keep the money machine to itself, but it's important that the company pay attention to the ethical issues and social impacts. In contrast, most of Google's large language models are open-sourced. This is worth examining through the lens of fairness. The development of Open-AI builds upon all the previous open-sourced models. If all companies stop publishing their models, how can the AI research community progress? Open-AI researchers took advantage of other people’s open-source codes, but don’t give back to the community, is this a fair behavior? Powerful AI models depend heavily on pre-training, and withholding model code helps save training resources to some degree, therefore resulting in alleviating the unfairness in resources between big companies and individuals. Would it be considered an act of fairness? The ethical dilemma of the “To Release or Not To Release” puts us in a tricky situation where both positive and negative impacts of the release are complicated to compare with each other. Many other ethical lenses can also aid in analyzing the overall consequence on society. Follows the right not to be injured lens: no one would want to be harmed by AI models or their by-products, however, do we freely and knowingly choose to risk such injuries? Through the common good ethical lens, we know that sharing something good among the public benefits society and all, which is for the common good, then not sharing AI source which potentially damages society and creates chaos should also be considered for the common good. ![Generated by Dall-E](opai.png) In the absence of extensive training, funding, or scientific support, the potential for the development and deployment of extremist content using open-source large AI models is less threatening. Studies have shown that the provided API is sufficient to generate any kind of content since only the companies have enough resources to obtain state-of-the-art models. It is the AI companies’ responsibility to examine API usage and perform data and content control. The phony AI models trained for harmful use won't be as harmful as fine-tuned API generation with the original model. The threats of future unrestricted access to large AI models should be more comprehensively described. Bottlenecks of the current AI models such as the AI-generated human figures don’t have hands or the image having a wired vibe showing the unrealness, which will easily be solved with time. The main focus of the company should be to ensure the responsible and beneficial utilization of AI-generated content in the future. ### Reference [(University, S. C. (n.d.). Open source ai: To release or not to release the gpt-2 synthetic text generator. Retrieved April 29, 2023](https://www.scu.edu/ethics/focus-areas/technology-ethics/resources/open-source-ai-to-release-or-not-to-release-the-gpt-2-synthetic-text-generator/) [Zhang, C., Zhang, C., Zheng, S., Qiao, Y., Li, C., Zhang, M., Dam, S. K., Thwal, C. M., Tun, Y. L., Huy, L. L., kim, D., Bae, S.-H., Lee, L.-H., Yang, Y., Shen, H. T., Kweon, I. S., & Hong, C. S. (2023). A complete survey on generative ai (Aigc): Is chatgpt from gpt-4 to gpt-5 all you need? arXiv. ](https://doi.org/10.48550/arXiv.2303.11717) [McGuffie, K., & Newhouse, A. (2020). The radicalization risks of gpt-3 and advanced neural language models. Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623.](https://doi.org/10.1145/3442188.3445922 ) [Metz, C., & Wakabayashi, D. (2020, December 3). Google researcher says she was fired over paper highlighting bias in a. I. The New York Times.](https://www.nytimes.com/2020/12/03/technology/google-researcher-timnit-gebru.html )