Chapter 20: Pre-trained Models for Text, Vision, and Audio

What you will learn

By the end of this chapter, you will be able to:

Recognize when an existing pre-trained model is the right starting point, rather than building or training your own
Navigate Hugging Face to find, evaluate, and test models directly in your browser before writing any code
Match common text analysis tasks (semantic similarity, zero-shot classification, translation, summarization, and sentiment analysis) to appropriate pre-trained models
Match common computer vision tasks (image classification, object detection, segmentation, and visual question answering) to appropriate pre-trained models
Use CLIP and BioCLIP for zero-shot image classification when labeled training data is not available
Transcribe audio recordings with Whisper and recognize where its output needs human review before research use
Distinguish between analytical models, which extract information from existing data, and generative models, which produce new content, and choose accordingly for your research task

The previous chapters focused on AutoML workflows where you bring your own labeled dataset and AutoGluon handles the modeling. Chapter 19 is a good example: you provide labeled rows, and AutoGluon fine-tunes a pretrained backbone on your prediction task. That approach makes sense when your goal is prediction and you have the labeled data to train on.

But a lot of research work does not look like that. Sometimes you have no labeled data at all. Sometimes your goal is not prediction but understanding: you want to make sense of interview transcripts, organize field photographs, or translate survey responses collected in three different languages. For these kinds of tasks, the most practical first question is not “how do I train a model?” but rather “is there already a model that does this?”

The answer is often yes. The machine learning community has produced a large ecosystem of pre-trained models, many of them freely available through Hugging Face. Hugging Face is a platform that hosts tens of thousands of open-source models, datasets, and interactive browser-based demos. You can browse models by task, read documentation, and in many cases test them directly in your browser before writing a single line of code. Think of it as a model library combined with a sandbox.

This chapter introduces a set of these models across text, vision, and audio. The goal is partly to show you what is out there, and partly to build the habit of checking for a pre-trained solution before committing to building your own. That instinct will save you considerable time.

One distinction worth keeping in mind as you read: the models here fall into two broad categories. Most of them are analytical, meaning they take existing data and extract information from it. A smaller set are generative, meaning they produce new content such as synthesized images or spoken audio. The two types serve different research purposes and come with different considerations around validation and appropriate use. The chapter flags this distinction as it comes up.

Text Analysis with Pre-trained Models

Background: The Transformer Paradigm

Most modern text AI models share a common architectural foundation called the transformer, introduced by Vaswani and colleagues in 2017 [Vaswani et al., 2017]. The core innovation was an attention mechanism: rather than reading text word by word in sequence, the model learns which words are most relevant to each other across an entire sentence or passage. This allowed models to capture long-range dependencies in language far more effectively than earlier approaches.

Building on this foundation, Devlin and colleagues introduced BERT (Bidirectional Encoder Representations from Transformers) in 2019 [Devlin et al., 2019]. What made BERT distinctive was that it reads text in both directions simultaneously, considering the full context around every word rather than only what comes before. BERT was pre-trained on a large amount of text using a masked prediction task, where the model learned to fill in randomly hidden words based on surrounding context. This produced a general-purpose language understanding model that could then be adapted to almost any text task with minimal additional training.

BERT became the blueprint for a generation of specialized models. The models you will encounter in this chapter, including BART for summarization and classification [Lewis et al., 2020], RoBERTa for sentiment analysis [Liu et al., 2019], and embedding models for semantic similarity, are all built on the same transformer foundation that BERT helped establish. Chapter 23 goes deeper into how BERT works and how you can fine-tune it for your own research tasks.

Model Overview

The table below summarizes the six pre-trained text models covered in this chapter, organized by the kind of task they are best suited for.

Model	Task	What it does	Research uses
EmbeddingGemma-300M	Semantic similarity	Converts text into vectors that reflect meaning	Clustering open-ended responses; deduplication; similarity search
BART-Large-MNLI	Zero-shot classification	Assigns user-defined labels without any training data	Thematic coding; filtering interview excerpts; sorting policy documents
Helsinki-NLP / OPUS-MT	Machine translation	Translates between specific language pairs	Multilingual interviews; cross-regional comparative research
BART-Large-CNN	Summarization	Produces concise summaries of long documents	Triaging reports, interviews, and policy texts
BERT Base Uncased	General language understanding	Foundation model for sentence context and masked prediction	Named entity recognition; document classification; basis for fine-tuning
CardiffNLP Twitter RoBERTa	Sentiment analysis	Classifies sentiment in informal text including slang and emoji	Social media monitoring; public perception studies

Using These Models

Semantic similarity and clustering. When you have a large collection of open-ended responses and want to see how they group together conceptually, embedding models are the right starting point. EmbeddingGemma-300M [Schechter Vera et al., 2025] converts each piece of text into a numerical vector that positions similar ideas close together in a high-dimensional space. The model captures meaning beyond surface wording, so responses like “I felt overwhelmed by the workload” and “there was too much to handle” would land near each other even though they share almost no words. From there, you can apply clustering algorithms to identify natural groupings, or calculate how conceptually close any two documents are to each other. This kind of analysis is especially useful in early-stage qualitative work, before you have settled on a coding scheme.

Zero-shot classification. BART-Large-MNLI lets you assign labels to text without needing any labeled training data at all. You simply provide the categories you want, and the model decides which one fits best. A policy researcher might provide labels like “economic concerns,” “housing access,” or “climate policy” and apply them to hundreds of interview excerpts in minutes. The model uses natural-language inference internally, comparing each piece of text against each candidate label to determine the best match. The main thing to watch out for is label phrasing: vague or overlapping labels tend to produce inconsistent results, so it helps to test a few phrasings before running across a full dataset.

Machine translation. The Helsinki-NLP OPUS-MT models are trained on public multilingual corpora and optimized for specific language pairs, with separate models available for hundreds of combinations [Tiedemann and Thottingal, 2020]. For researchers working with multilingual data, these models offer a transparent, open-source alternative to commercial translation APIs. They are particularly well-suited to translating interviews, field notes, or survey responses into a shared analysis language while preserving a clear record of what was translated and how.

Summarization. BART-Large-CNN is an abstractive summarization model, meaning it generates a new condensed version of a document rather than extracting sentences verbatim. This is useful for quickly determining whether a long report or interview is relevant to your research question before reading it in full. Summarization also works well as a preprocessing step: generating summaries first and then applying classification or clustering to the summaries rather than the full documents can significantly speed up analysis of large corpora.

Sentiment analysis. CardiffNLP Twitter RoBERTa is a fine-tuned version of RoBERTa [Liu et al., 2019], trained specifically on tweet data to handle informal writing, slang, emojis, and sarcasm that standard sentiment models tend to misclassify [Barbieri et al., 2020]. For researchers studying public attitudes at scale, whether through social media, online reviews, or open survey responses, this model provides a fast baseline for understanding the emotional tone of text before moving into deeper qualitative analysis.

Computer Vision with Pre-trained Models

Background

Computer vision tasks are not all the same kind of problem, and it helps to understand the distinctions before choosing a model. The four main task types represented in this chapter are image classification (what category does this image belong to?), object detection (where are specific objects located within the image?), image segmentation (which pixels belong to which object?), and visual question answering (what can you infer from the content of this image?). Each type produces a different kind of output and suits different research scenarios. The sections below introduce one representative model for each task type.

Image Classification: Vision Transformer (ViT)

Image classification assigns a single label to an entire image. The Vision Transformer (ViT) applies the same attention mechanism from the text transformer architecture to images by breaking a photograph into fixed-size patches and analyzing them jointly [Dosovitskiy et al., 2021]. This gives ViT strong generalization across diverse image types without requiring specialized preprocessing.

For research purposes, ViT is a good starting point when you have a large collection of images that need to be sorted or labeled quickly and you are not yet sure whether a custom model is necessary. You can test it directly in your browser via the Hugging Face model page. The main constraint is that the model produces a single category label per image and cannot tell you where within the image something appears or how many instances there are.

Example. A field ecologist working with thousands of camera trap photographs can use ViT to quickly separate images into broad categories such as animal present, empty frame, or camera malfunction, before moving into species-level analysis.

Zero-Shot Image Classification: CLIP and BioCLIP

ViT requires labeled training examples to classify images into your categories. CLIP (Contrastive Language-Image Pretraining), developed by OpenAI, takes a different approach [Radford et al., 2021]. Instead of learning fixed categories, CLIP learns a shared embedding space where images and text descriptions are aligned. This means you can classify images using natural language prompts without any retraining. You write out candidate labels as short descriptions, such as “a photograph of dense forest” or “an aerial view of urban development,” and CLIP ranks each image against those descriptions based on similarity.

This zero-shot capability is particularly useful when your categories are still evolving, when labeled training data is hard to collect, or when you are doing exploratory work and want to test several classification schemes before committing to one. A browser demo is available through OpenCLIP on Hugging Face.

For ecological and biodiversity research, BioCLIP offers a domain-specific alternative [Stevens et al., 2024]. It is fine-tuned on a large collection of biological specimen images spanning hundreds of thousands of taxa, which gives it a significant advantage over general CLIP for species-level identification. Researchers working with camera trap images, herbarium specimens, or field photographs can apply BioCLIP to species identification tasks where general-purpose models tend to struggle with fine-grained visual distinctions within a taxon. A demo is available on Hugging Face.

The practical choice between ViT and CLIP comes down to whether you have labeled data. When labeled examples exist and categories are well-defined, a fine-tuned ViT will usually perform better. When you are exploring or when your categories are unusual, CLIP gets you started immediately without any labeling effort.

Example. A researcher studying urban green space could classify satellite image patches into categories such as tree canopy, grass, impervious surface, and water by writing those descriptions as text prompts, with no need to collect labeled training images.

Object Detection: Grounding DINO

Object detection goes further than classification by locating specific objects within an image and drawing bounding boxes around them. Grounding DINO is a zero-shot detection model, meaning you describe what you are looking for in plain language and the model finds it without needing any labeled training examples [Liu et al., 2023]. You might ask it to locate “solar panels” in a satellite image, or “protective equipment” in a set of workplace photographs, and it will return bounding boxes around the matching regions. You can try it at the Hugging Face demo.

The key strength here is flexibility. Because the model accepts free-form natural language descriptions rather than fixed class lists, you can adapt it to unusual object types or domain-specific terminology that a standard detection model might not recognize. The trade-off is speed: Grounding DINO is slower than specialized detectors trained for a narrow task.

Example. Environmental scientists monitoring land use change can use Grounding DINO to locate wind turbines or solar installations across large satellite image archives, without building a custom training dataset from scratch.

Image Segmentation: Segment Anything (SAM)

Segmentation goes beyond bounding boxes to identify which individual pixels belong to a given object, producing precise outlines rather than rectangular regions. The Segment Anything Model (SAM), developed by Meta AI, can segment objects across essentially any image domain without retraining [Kirillov et al., 2023]. You can interact with it through clicks, bounding boxes, or by asking it to automatically propose segments for an entire image. A browser demo is available at segment-anything.com.

The important limitation to keep in mind is that SAM identifies and outlines objects without naming them. It can tell you where things are, but not what they are. In practice this is often used as a first step, with a classification model applied afterward to label each segmented region.

Example. Microbiologists use SAM to automatically outline individual cells in microscopy images, replacing hours of manual tracing and making much larger sample sizes practical.

Visual Question Answering: Qwen3-VL

Visual question answering models can read an image and respond to open-ended questions about its content. Qwen3-VL, the latest generation of Alibaba’s Qwen vision-language series [Wang et al., 2023], takes both an image and a natural-language question as input and generates a descriptive answer. You can ask it things like “What kind of vegetation is in this photograph?” or “Is the person in this image wearing protective equipment?” and receive a text response. A browser demo is available on Hugging Face Spaces.

This kind of model is useful for exploratory analysis of visual materials, rapid documentation of image collections, and generating structured descriptions at scale. The main caveat is that accuracy varies depending on image quality and how familiar the content is to the model, so results should be spot-checked rather than taken at face value.

Example. Researchers working with historical photograph archives can use Qwen3-VL to generate preliminary descriptive metadata for large collections, then review and correct the outputs manually.

Image Captioning: BLIP

The models above either assign predefined categories to images or answer specific questions about them. BLIP (Bootstrapping Language-Image Pre-training) serves a different purpose: it generates free-text descriptions of image content without requiring a prompt [Li et al., 2022]. Given an image, BLIP produces a sentence or two describing what it sees, which makes it practical for generating descriptive metadata at scale, creating alt-text for figures, or producing text that downstream NLP tools can then process and analyze.

For researchers managing large archives of photographs, microscopy images, or field images, this captioning capability offers a way to make visual material searchable and summarizable without writing descriptions by hand. BLIP also supports visual question answering, but where Qwen3-VL handles open-ended and complex reasoning tasks, BLIP is generally more efficient for straightforward captioning workflows where you want a description of image content rather than an answer to a specific question. A browser demo is available at Hugging Face Spaces.

Example. A researcher archiving a large collection of field survey photographs could use BLIP to generate draft descriptions for each image, then review and correct the outputs rather than composing them from scratch.

Vision Model Comparison

Model	Task type	Output	Strengths	Limitations	Research uses
ViT	Classification	Single label per image	Strong baseline; efficient; generalizes well	Requires labeled data for fine-tuning	Sorting large image collections; rapid labeling
CLIP / BioCLIP	Zero-shot classification	Label + similarity score	No labeled data needed; flexible text prompts; BioCLIP specialized for species	Weaker than fine-tuned models on narrow tasks	Exploratory labeling; biodiversity identification
Grounding DINO	Object detection	Bounding boxes	Zero-shot; accepts natural language	Slower than specialized detectors	Mapping objects in satellite images; locating domain-specific items
SAM	Segmentation	Pixel-level masks	Precise outlines; domain-agnostic; no training required	Does not label objects	Cell segmentation; region identification in scientific images
BLIP	Image captioning	Descriptive text	Generates natural descriptions without prompting	Less capable than Qwen3-VL for complex reasoning	Metadata generation; making image archives searchable
Qwen3-VL	Visual Q&A	Text response	Open-ended; flexible; integrates vision and language	Variable accuracy; higher compute needs	Documentation; exploratory analysis; metadata generation

Audio: Transcription and Speech

Many researchers work with audio and never think of it as something a pre-trained model could handle. Recorded interviews, focus group sessions, conference presentations, and oral history archives all contain information that typically requires manual transcription before it can be analyzed. Whisper, developed by OpenAI, changes that considerably [Radford et al., 2023].

Whisper is a speech recognition model trained on a large and diverse corpus of audio from the web, covering many languages and acoustic conditions. It converts spoken audio into text with accuracy strong enough for most research transcription tasks, and it handles background noise, varied accents, and technical vocabulary better than earlier automatic systems. Because the model is open and can be run locally, it is suitable for sensitive research contexts where recordings should not leave your institution’s infrastructure. You can access it through the Hugging Face model page, and Chapter 13 covers the compute options for running it on Great Lakes or a local GPU.

The main limitation worth knowing about is overlapping speakers. Whisper transcribes what it hears but does not automatically separate multiple voices, which means focus group recordings or panel discussions may need post-processing to attribute speech to individual participants. For single-speaker recordings such as individual interviews or lectures, accuracy is generally strong enough to use directly in qualitative analysis, with a round of human review to catch errors in domain-specific terminology.

Beyond transcription, text-to-speech models can convert written text back into natural-sounding audio. This has practical applications in research communication: generating narration for video presentations, producing accessible audio versions of written materials, or creating spoken instructions for experiments. Tools like Bark (open-source) and ElevenLabs (commercial) offer this capability through simple interfaces. For most research workflows this is a secondary application, but the option is there when you need it.

Image Generation: Stable Diffusion

The models covered so far in this chapter are all analytical: they take existing data and extract information from it. Stable Diffusion works differently. It is a generative model that produces new images from text descriptions [Rombach et al., 2022], which puts it in the same broad category as large language models, just for visual content rather than text.

The underlying mechanism is a diffusion process: the model starts from random noise and iteratively refines it, guided by the meaning of your prompt, until a coherent image emerges. Stable Diffusion is open-source and can run on consumer hardware with a modest GPU, which makes it more accessible than many proprietary image generation systems.

For researchers, the most useful applications are not about generating photorealistic images as evidence, but rather about producing visual communication materials faster than traditional illustration workflows would allow. An ecologist explaining habitat fragmentation, a social scientist visualizing a theoretical concept, or an educator building teaching slides can all use image generation to produce conceptual diagrams and placeholder figures without commissioning custom illustrations or spending hours in design software. The model is also used in computer vision research to generate synthetic training data for scenarios that are difficult to photograph in the real world, though results should always be validated against real images before drawing conclusions.

Because Stable Diffusion can run locally, it is appropriate for research settings where visual materials involve sensitive topics or proprietary content that should not be sent to external services. Graphical interfaces like Automatic1111 and ComfyUI make it usable without writing code. A cloud-based option is available through Hugging Face Spaces.

One boundary that matters for research use: generated images should not be presented as photographs or data. They are production tools for communication and experimentation. Journals and conferences increasingly have explicit policies about AI-generated figures, and the right practice is always to disclose when generated imagery appears in a submission. Chapter 7 discusses this in the context of research writing and communication.

Last reviewed: May 2026. Tool-specific content in this chapter refers to the Hugging Face Transformers ecosystem. Model availability and browser interfaces on platforms like Hugging Face Spaces change frequently. If you notice outdated content, open an issue on GitHub.

[1]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. 2017. URL: https://arxiv.org/abs/1706.03762.

[2]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Association for Computational Linguistics, 2019. URL: https://arxiv.org/abs/1810.04805, doi:10.18653/v1/N19-1423.

[3]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7871–7880. 2020. URL: https://arxiv.org/abs/1910.13461.

[4] (1,2)

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019. URL: https://arxiv.org/abs/1907.11692.

[5]

Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Martins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie Wu, Jingxiao Zheng, Jyotinder Singh, Abheesht Sharma, Divya Sreepat, Aashi Jain, Adham Elarabawy, AJ Co, Andreas Doumanoglou, Babak Samari, Ben Hora, Brian Potetz, Dahun Kim, Enrique Alfonseca, Fedor Moiseev, Feng Han, Frank Palma Gomez, Gustavo Hernández Ábrego, Hesen Zhang, Hui Hui, Jay Han, Karan Gill, Ke Chen, Koert Chen, Madhuri Shanbhogue, Michael Boratko, Paul Suganthan, Sai Meher Karthik Duddu, Sandeep Mariserla, Setareh Ariafar, Shanfeng Zhang, Shijie Zhang, Simon Baumgartner, Sonam Goenka, Steve Qiu, Tanmaya Dabral, Trevor Walker, Vikram Rao, Waleed Khawaja, Wenlei Zhou, Xiaoqi Ren, Ye Xia, Yichang Chen, Yi-Ting Chen, Zhe Dong, Zhongli Ding, Francesco Visin, Gaël Liu, Jiageng Zhang, Kathleen Kenealy, Michelle Casbon, Ravin Kumar, Thomas Mesnard, Zach Gleicher, Cormac Brick, Olivier Lacombe, Adam Roberts, Yunhsuan Sung, Raphael Hoffmann, Tris Warkentin, Armand Joulin, Tom Duerig, and Mojtaba Seyedhosseini. EmbeddingGemma: powerful and lightweight text representations. arXiv preprint arXiv:2509.20354, 2025. URL: https://arxiv.org/abs/2509.20354.

[6]

Jörg Tiedemann and Santhosh Thottingal. OPUS-MT – building open translation services for the world. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, 479–480. 2020. URL: https://aclanthology.org/2020.eamt-1.61/.

[7]

Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa Anke, and Leonardo Neves. TweetEval: unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, 1644–1650. 2020. URL: https://arxiv.org/abs/2010.12421.

[8]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations. 2021. URL: https://arxiv.org/abs/2010.11929.

[9]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, 8748–8763. 2021. URL: https://arxiv.org/abs/2103.00020.

[10]

Samuel Stevens, Jiaman Wu, Matthew J. Thompson, Elizabeth G. Campolongo, Chan Hee Song, David E. Carlyn, Li Dong, Wasila M. Dahdul, Charles Stewart, Tanya Berger-Wolf, Wei-Lun Chao, and Yu Su. BioCLIP: a vision foundation model for the tree of life. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19412–19424. 2024. URL: https://arxiv.org/abs/2311.18803.

[11]

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. URL: https://arxiv.org/abs/2303.05499.

[12]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4015–4026. 2023. URL: https://arxiv.org/abs/2304.02643.

[13]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. URL: https://arxiv.org/abs/2308.12966.

[14]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, 12888–12900. 2022. URL: https://arxiv.org/abs/2201.12086.

[15]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, 28492–28518. 2023. URL: https://arxiv.org/abs/2212.04356.

[16]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695. 2022. URL: https://arxiv.org/abs/2112.10752.