With the advent of ChatGPT, the research of generative artificial intelligence (GAI) has made a breakthrough in the field of multimodal information processing, such as text, image, and video, and has attracted broad attention. This paper aims to systematically review the research progress of GAI and to discuss its future development trend. Being divided into three parts, the paper first reviewed the development history and research status of GAI in terms of natural language models, image and multimodal models; secondly, it discussed the application prospects of GAI in different fields, mainly focusing on content communication, assisted design, content creation, personalized customization, and etc. In the third part, with an in-depth analysis of the main challenges facing GAI, the author summarized the development trends of GAI in the future.
CHE Lu
,
ZHANG Zhiqiang
,
ZHOU Jinjia
,
LI Lei
. The research status and development trends of generative artificial intelligence[J]. Science & Technology Review, 2024
, 42(12)
: 35
-43
.
DOI: 10.3981/j.issn.1000-7857.2024.01.00029
[1] Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model[J]. Advances in Neural Information Processing Systems, 2000, 13:932-938.
[2] Mikolov T, Karafiát M, Burget L, et al. Recurrent neural network based language model[C]//Interspeech. Baltimore:Johns Hopkins University, 2010, 2(3):1045-1048.
[3] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.
[4] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York:ACM, 2017:6000-6010.
[5] Devlin J, Chang M W, Lee K, et al. BERT:Pre-training of deep bidirectional transformers for language understanding[EB/OL].[2024-03-04]. http://arxiv.org/abs/1810.04805.
[6] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[EB/OL]. (2018-06-11)[2024-03-04]. https://openai.com/blog/language-unsupervised.
[7] Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[EB/OL]. (2019-02-14)[2024-03-04]. https://openai.com/blog/better-languagemodels.
[8] Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[EB/OL].[2024-03-04]. http://arxiv.org/abs/2005.14165.
[9] Ouyang L, Wu J, Xu J, et al. Training language models to follow instructions with human feedback[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems. New York:ACM, 2022:27730-27744.
[10] OpenAI. ChatGPT[EB/OL]. (2022-11-30)[2024-03-04]. https://openai.com/index/chatgpt.
[11] Achiam J, Adler S, Agarwal S, et al. GPT-4 technical report[EB/OL]. (2023-03-15)[2024-03-04]. https://arxiv.org/abs/2303.08774.
[12] Ding N, Qin Y J, Yang G, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models[J]. Nature Machine Intelligence, 2023, 5:220-235.
[13] Wu T, Luo L, Li Y F, et al. Continual learning for large language models:A survey[EB/OL]. (2024-02-07)[2024-03-10]. https://arxiv.org/abs/2402.01364.
[14] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6):84-90.
[15] Kingma D P, Welling M. Auto-encoding variational Bayes[EB/OL].[2024-03-15]. http://arxiv.org/abs/1312.6114.
[16] Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2. New York:ACM, 2014:2672-2680.
[17] Karras T, Laine S, Aila T M. A style-based generator architecture for generative adversarial networks[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ:IEEE, 2019:4401-4410.
[18] Karras T, Laine S, Aittala M, et al. Analyzing and improving the image quality of stylegan[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, NJ:IEEE, 2020:8110-8119.
[19] Karras T, Aittala M, Laine S, et al. Alias-free generative adversarial networks[EB/OL].[2024-03-15]. http://arxiv.org/abs/2106.12423.
[20] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words:Transformers for image recognition at scale[EB/OL].[2024-04-11]. http://arxiv.org/abs/2010.11929.
[21] Liu Z, Lin Y T, Cao Y, et al. Swin transformer:Hierarchical vision transformer using shifted windows[EB/OL].[2024-04-13]. http://arxiv.org/abs/2103.14030.
[22] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[EB/OL].[2024-05-01]. http://arxiv.org/abs/2006.11239.
[23] Ramesh A, Pavlov M, Goh G, et al. Zero-shot text-toimage generation[EB/OL].[2024-05-01]. http://arxiv.org/abs/2102.12092.
[24] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[EB/OL].[2024-05-05]. http://arxiv.org/abs/2103.00020.
[25] Ramesh A, Dhariwal P, Nichol A, et al. Hierarchical text-conditional image generation with CLIP latents[EB/OL].[2024-05-06]. http://arxiv.org/abs/2204.06125.
[26] Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models[EB/OL].[2023-12-21]. http://arxiv.org/abs/2112. 10752.
[27] David H. MidJourney[EB/OL]. (2022-07-12)[2023-12-21]. https://www.midjourney.com/explore.
[28] Esser P, Chiu J, Atighehchian P, et al. Structure and content-guided video synthesis with diffusion models[EB/OL]. (2023-02-06)[2024-05-11]. https://arxiv.org/abs/2302.03011.
[29] Demi G. Pika[EB/OL]. (2023-11-29)[2024-05-11]. https://pika.art.
[30] Zhang S, Wang J, Zhang Y, et al. I2vgen-xl:High-quality image-to-video synthesis via cascaded diffusion models[EB/OL].[2023-11-07]. https://arxiv.org/abs/2311.04145.