Quintin's Newsletter
Posts
A Non-Technical Guide to GenAI

A Non-Technical Guide to GenAI

Everything you need to know about GenAI, without the technical stuff.

Quintin Au
February 22, 2024

When I first started researching about AI, all the articles were made for ML researchers or engineers that understood “embeddings”, “neural nets”, and “matrix multiplication”.

But I just wanted to know how AI would affect my day-to-day life.

So I spent months dissecting dozens of technical videos, articles, and jargon, and distilling the most relevant parts into this handbook.

It’s made to help people in tech like me - product managers, growth managers, product marketers, customer success, designers and salespeople. The better we understand how AI works, the better we can learn to use AI to help us grow, learn, and adapt to the changing world.

I update this regularly.
I keep it high-level.
I use visuals & memes.
I love feedback.

Let’s start.

❝

AI will replace the people who think it will.

Naval Ravikant

What you’ll learn

This is a non-technical, first-principles handbook for understanding AI and GenAI. This is the first part of a three-part series.

By the end of this first part, you’ll be be able to:

Explain AI to your boss, a 5th grader, or your parents
Understand the technical jargon you NEED to know for GenAI
What LLMs are and how they changed the way we interact with AI
Why GenAI applications are exploding, the different categories that are emerging, and what are some of their limitations

Chapter I: AI is a prediction machine, just like you

Read these sentences:

❝

Two plus two equals…

Did you naturally finish the sentence with “four”?

How about this:

❝

What pets have four legs?

Did you answer “dog or cat”?

In both cases, your brain predicted what would come next.

And that’s exactly how ChatGPT works - give it a prompt (a task or a question) and it’ll predict what comes next.

Give it a math problem, it’ll predict the solution.
Give it a question, it’ll predict the answer.
Give it a task, it’ll predict what steps need to be taken.

Obviously, ChatGPT doesn’t know Popeyes’ secret recipe. But based on my ask, it’s predicted that I want a fried chicken recipe and generates one for me.

Our brains work the same way.

We’re always predicting - what the weather will be tomorrow, what to talk to your co-workers about at the work party, if there’s a lion in that bush, how to ask our boss for a raise, how to make small talk with your doorman.

Some of us call this anxiety. But for AI, this power of prediction gives them intelligence useful for a lot of variable, but repetitive tasks we do everyday.

Define AI in one sentence

At its core, artificial intelligence takes an input and predicts the output, by optimizing for a certain objective.

Input. Output. Objective. It’s that simple?

Virtually every AI works off this fundamental principle.

Understanding this made AI a lot less intimidating for me. For example, when we think of AI, we usually think of robots that look this.

The real reason why people are trying to stop AI research. Animals.

They’re constantly being fed data about their surroundings, body position, floor (inputs) which allow them to predict how to move their joints and limbs (outputs) to stay upright (objective).

Input. Output. Objective.

You’ve probably interacted with AI like this today:

Youtube takes your watch history (input), gives you recommended videos (output), optimizing for watch time (objective).
Spotify takes your liked songs (input), gives you new playlists (output), optimizing for listening time (objective).
ChatGPT takes your questions (input), and gives you answers (output), optimizing for right answers (objective).
Weather app takes cloud patterns (input), and gives you tomorrow’s forecast (output), optimizing for accuracy (objective).
Google takes your query (input), and gives you relevant webpages (output), optimizing for relevancy & clicks (objective).
Tesla takes images of your surroundings (input), and gives you whether it’s a person, traffic light, car or sidewalk (output), optimizing for not crashing into them (objective).

The power of AI is prediction.

🤔 Companies are made of people who make decisions, based on predictions. The better the prediction, the better the decisions.

AI are prediction engines.

Where could better predictions help you in your job? Understanding this will allow you see where AI fits into your company’s strategy.

Netflix takes my watch history (input), predicts other recommended shows (output), to optimize for monthly watch time (objective).

How do our brains get better at predicting?

We’re exposed to enormous amounts of scenarios, examples, and stories from family, friends, and the internet as we grow up. We form our own conclusions. We do trial and error. We go to college. And each piece of information helps us get better at predicting and making decisions.

AI models learn in the same ways.

AI models need to be fed vast amounts of data and examples related to its intended task. And like humans, there are many ways AI models are “trained”, each method helping the AI model get better at predicting accurately.

Modern day AI were inspired by how human brains work. “Neural nets”, the core breakthrough that underpin AI today, borrow from our how neurons learn & fire. As a result, the ways we train AI are similar to how we train humans (seen above). These terms are not needed for this chapter, but will be covered in the future.

Previously, these AIs were only available to the largest corporations because you needed to know code to interact with them. And only companies with the best ML researchers & data scientists could code them.

But the discovery of Large Language Models (LLMs) have made it considerably easier for anyone to interact & instruct AIs with plain english. No code needed.

LLMs sparked the wave of GenAI tools we see today.

Chapter II: How Large Language Models democratized access to AIs

❝

The development of AI is as fundamental as the creation of the microprocessor, the personal computer, the Internet, and the mobile phone

Bill Gates

If you want to understand GenAI, you need to know what LLMs are.

Large Language Models (LLMs) are just that - AI models trained on trillions of bodies of text from Wikipedia, GitHub, and other sites.

Just like how our brains learn languages by example from family, friends, teachers, textbooks, the internet, and movies, LLMs learns by ingesting massive amounts of examples.

In doing so, LLMs learn relationships between words - like “how” is often followed by “to [task]”, “nugget” usually comes after “chicken” or “gold”, how “right” can mean “correct” or “the opposite of left” depending the context, and even how code works. When we talk, we have to predict and say the next word in the sentence. LLMs do the same. LLMs predict the next word, the next sentence, and the next paragraph, to form cohesive thoughts, and then generate it.

And with enough data, they become incredible at predicting & generating text.

LLMs are actually really advanced Google autocompletes. They try to predict the next few words to complete your Google query.
Most of the time they’re right. But other times…

🤔 How do LLMs predict words?
LLMs will predict the next word based on probability. Given the context/past words, it will rank words based on how most likely they are to appear next. Then, it usually selects the highest probability word.

In practice, researchers will use sampling methods (picking from a set of high probability words/phrases) to add “creativity” to answers. I’ll touch on this in later articles.

A simplistic example of how a LLM predicts words.
It selects the highest probability word, given the last word, sentence, or paragraph.

Researchers started to stress test LLMs on a range of tasks and were surprised they were able to:

Moments after GPT 4 went public.

🤯 AIs don’t understand, they predict
LLMs don’t understand the answer, they just “predict” the answer to questions. It’s an important distinction. As a result, they’re prone to inaccurate answers or hallucinations, when AI makes up facts.

This is an example of Google Bard’s LLM incorrectly naming months because it picked up on a pattern rather than giving the factual answer.

My take: Humans hallucinate and make up facts all the time (looking at you politicians 👀). And as LLMs get better, hallucinations will happen less, and eventually less than humans. It’s happened for self-driving cars where Tesla autopilots are 9x safer than human.

Interacting with AI using text, not code.

Before LLMs, there were traditional Natural Language Processing (NLP) models. Traditional NLP models helped us with a specific language task like sentiment analysis (analyzing text like a customer review to see if it’s positive or negative sentiment). But you needed to know how to code logic, rules, & statistics to program the models.

Where as for LLMs, you just needed to use english to direct the model to do the same task.

Example of using ChatGPT for sentiment analysis using text.
I could also upload a CSV of reviews and ask ChatGPT to rate each review out of 10. No code needed.

LLMs made it easier for anyone to use AI because:

No code needed: You no longer needed to know code to interact & instruct AIs.
Multi-use models: You only needed to interact with one model, instead of creating multiple.

LLMs did to AI, what smartphones did to software.

Smartphones democratized software - with touch.
LLMs democratized AI - with text.

And when technology is democratized, creativity is too.

Chapter III: The GenAI revolution

Before the 2020s, most AIs were great at predicting but they still needed to reference existing data.

Think Netflix’s recommendation model. Netflix can predict the next show you’re likely to watch, but it needs to pull the title, thumbnail, and description that’s already stored in a database.

But what if Netflix’s recommendation model, could predict the next show and generate the ideal thumbnail and description, personalized to each Netflix user’s tastes and what they’re most likely to press play on?

Researchers wondered, “could AIs be creative and generate text or images that hasn’t been made before?”

They started developing AI models that generated content in the form of text, and then later images, videos, and audio. All of which are more useful to a human, than just a prediction.

“Generative models allow machines to learn from data and then create new, original content, and they have the potential to revolutionize industries from music to fashion to gaming.”

Yann LeCun, Director of AI Research at Facebook

Just like how in the 1970s, computers were owned by hobbyists. They were technologically fascinating, not very useful. But once personal computers made it into the hands of consumers and could start making excel sheets and word docs, it became useful.

By being able to predict and generate content, GenAI created outputs that were not just new, but useful to us.

This new wave of GenAI applications allowed us to generate in more creative mediums beyond text such as image, video, and audio - which we call modalities, which is a fancy term for different types of data.

GenAI or “Generative AI” are applications that use AI models to output text, image, video or audio.

Many types of GenAI models follow a similar format: they take input in one modality, and output in another modality.

The most popular ones are:

Text-to-text (eg. ChatGPT)
Text-to-image (eg. Midjourney)
Text-to-video (eg. RunwayML)
Text-to-audio (eg. Speechify)
Audio-to-text (eg. Descript)

Each type of model are trained on different data, using different AI architectures, requiring specialized researchers, resulting in different levels of accuracy and reliability. As a result, each type of model are progressing at different speeds.

Below is a high-level snapshot of the capabilities of each type of model today:

As of March 2025.
Expert-level: Performs majority of tasks on par with experts in the field
Human-level: Performs many tasks on par with humans in the field
Task-level: Performs one task on par with humans in the field

Text-to-text models like GPT-4 and Perplexity are LLM-centric GenAI apps that have consistent accuracy and reliability good enough for professional use that I use them for:

Teaching myself technical concepts using analogies
Debugging code/write code in SQL or HTML
Creating new recipes with my personal tastes
Research & pulling direct citations/quotes
Summarizing my legal docs
Translating marketing emails/copy
Brainstorming different marketing emails/copy

As researchers develop better architectures and train models on better data, we can expect AI to get better.

Image, audio, and video models are still struggling with their own problems.

Image models struggle with consistency for branding.
Audio models struggle with natural sounding speech & recently, pacing.
Video models struggle with speed of generation & a UX to tweak/edit clips.

I’ll often use a suite of GenAI applications.

For general tasks, learning, and step-by-step instructions, I’ll use GPT-4.
For research & fact-checking, I’ll use Perplexity.
For stock photo generation, I’ll use Midjourney.
For editing clips/shorts from a long-form video, I’ll use Opus Pro.

🐇 Rabbit holes ahead:
If you’re curious about how each type of model works, here are the key concepts you should know (they are too technical for this article):

👉 LLMs have benefitted from transformers models & attention research.
👉 Image models have benefitted from U-net and diffusion models
👉 Video models have benefitted from space-time U-net models.

Treading towards multi-modal models & prompts.

However, real life isn’t always in well-formatted text.

I’m a messy notetaker. I take notes in Notion and Apple Notes. I scribble notes in a whiteboard and take pictures of it. I record audio notes when I’m out.

My notes live in digital text, pictures, and audio recordings. Reading my notes are a headache.

But I can actually upload them all to ChatGPT. ChatGPT would read the digital notes normally, but use computer vision to “read” pictures, and an audio-to-text model to “listen” to my audio notes to convert them to text.

I could ask it to consolidate all my notes in one text file.
I could ask it to summarize all my notes on a certain topic.
I could ask it to pull up all the relevant for a specific topic I was working on, despite whether it’s in Notion, in a picture, or in audio notes.

With context across all my scattered notes, my notes became 10x more valuable using ChatGPT, and it only took a few seconds.

How ChatGPT processes my inputs.
1. I create my text prompt & upload all relevant documents.
2. ChatGPT will use a combination of models to process each document into text.
3. A final text-to-text model combines all my 4 prompts into one, and generates an output.

✨ How much text, documents, and images can I upload to a model?

Context window is the short-term memory of a model.

The larger the context window, the larger the prompts & the more documents (and videos) you can upload to the model and it can recall from.

However, larger context windows usually come with larger compute costs and slower response times which creators of foundational models like Google are figuring out.

A multi-modal model like this allows for multiple types of data as prompts, from images, videos, audio, and text.

Another example is Midjourney. Midjourney is a GenAI application that allows users to create images using text prompts. They’ve grown to over $200M in revenue with only 40 employees - in under 2 years.

Midjourney allows users to input image prompts & text prompts.

How Midjourney allows the users to input image prompts, text prompts, and parameters (Midjourney-specific syntax for editing images)
Source: Midjourney Docs

You can input an image, and use text to change the style and objects in it.

You can also input 2 images, and combine them into a new image.

Okay, so why does multi-modality matter?

Multi-modality allows AI to see and hear the world the way we do.

And most importantly, it allows AI to interact with our world. But our world is messy.

Code is great at automating tasks, but it needs structured and consistently formatted data. Except the real world is anything but that. It’s variable, always-changing, and uncertain.

Our world is made up of trillions of websites, buttons, charts, graphs, meetings, research papers, PDFs, legal documents, Youtube videos, news articles, and podcasts that come in all different shapes and sizes.

However, multi-modal AIs are incredible at understanding and interacting with variable, new situations and data. In fact, “AI agents” are early attempts at creating AI that can navigate these modalities.

[Easy] Imagine sending an image of your broken bike to AI and have it send back what you need to do to fix it with relevant how-to images.
[Medium] Imagine if you could scan a 3-hour long basketball game, find when your favourite players, and edit them into a 7-min highlight reel.
[Hard] Imagine if you could scan the top travel sites, find combinations of flights and hotels that are under $500, send you the top 5 trips you could take, and book them.

Multi-modal models allow us to use AI in the real world full of uncertainty and variability. It’s why Sam Altman, CEO of OpenAI, is focusing on multimodality.

❝

"Multimodality will definitely be important. Speech in, speech out, images, eventually video."

Sam Altman, CEO of OpenAI

Next: Navigating GenAI product strategies

We’ve learnt that AI is a prediction machine.

LLMs are text prediction machines and have been essential for making GenAI models accessible to the non-technical people.

As a result, there has been a boom in Generative AI, AI applications that allow users to generate text, images, videos, and audio content in creative ways.

But the most valuable GenAI models incorporate multi-modality, the ability to understand and process different types of data, which allows us to use AI in the real world.

In the next part, Understanding GenAI product strategies, we’ll learn:

What makes a defensible GenAI application
The archetypes of GenAI startups
How GenAI startups are making millions and where value lives
Practical examples of GenAI applications and how you can use them in life & work

Got a question? Feedback? I love both.
👇 Drop a comment below.

Reply

or to participate.