Using Google Gemini API to Analyze Images with Python (Colab Demo) or local IDE

Q: How can I extract text from an image using Gemini AI?

You can use the Gemini Vision model via the generate_content() method by passing a prompt and an image (as a PIL object).

Q: How can I send multiline prompts?

You can use '\n' or triple quotes to send multiline prompts, or pass them as part of a list with the image.

Q: What does the temperature parameter do?

Temperature controls how creative or focused the AI responses are. Higher values create more diverse output.

Q: Can I limit the length of the response?

Yes, by using the max_tokens parameter you can limit the number of tokens (words/pieces) in the AI's response.

Q: Do I need internet access to use Gemini API?

Yes, Gemini API works via cloud services and requires an active internet connection and API key.

Q: Can I use this on my local machine instead of Colab?

Yes, by using a .env file for the API key and installing required packages with pip, the same code works on your local machine.

Gemini AI with Python: Extract Text from Images using Google Colab #aitools #colab

This tutorial shows how to use the Google Gemini API in Python to analyze and describe images with AI. You'll learn how to load images from the web or local files, send them to the Gemini model using Google Colab, and get smart, readable descriptions as output — all with just a few lines of code.

Before starting, make sure you’re familiar with Google Colab, where we’ll be running our code. You’ll also need an API key from Google AI Studio. Sign in with your Google account and generate a new key if you don’t already have one. To keep your key safe and hidden while working in Colab, we’ll use Colab’s built-in secrets manager.

Download this source code from Github or run it in your Google Colab platform for an interactive, hands-on experience to start with Gemini AI API.
https://github.com/plus2net/Python-basics/blob/main/Gemini_api_2_image_1.ipynb

Step 1: Downloading an Image from a URL

To begin working with image-based prompts using the Gemini API, the first step is to retrieve an image from an online source. Using Python’s requests library, we can fetch image data directly from a URL and store it in memory. This data will later be sent to the Gemini API for analysis or interaction.

import requests

# URL of the image
image_url = "https://www.go2india.in/upimg/9565.jpg"

# Download image content
response = requests.get(image_url)
image_data = response.content

Step 2: Converting the Image Data to an Image Object

Once the image is downloaded as binary content, we can use the PIL (Python Imaging Library) module to convert the byte stream into an image object. This allows further manipulation or display of the image in Python. The final print statement is used to preview a small portion of the binary data as a quick test.

from PIL import Image
from io import BytesIO

image = Image.open(BytesIO(image_data))

# for testing check the binary data
print(image_data[:20])

Step 3: Using Gemini API to Analyze the Image

With the image loaded and prepared, the next step is to send it to the Gemini API for analysis. This example uses google.generativeai to configure the API, authenticate using a secure key from the Colab environment, and pass the image along with a prompt asking the model to describe it. The try-except block ensures that errors are handled gracefully, particularly if the API key is missing or the image object is not available.

import google.generativeai as genai
from google.colab import userdata

try:
  GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
  genai.configure(api_key=GOOGLE_API_KEY)

  # Initialize the model that supports generateContent
  model = genai.GenerativeModel('gemini-2.5-flash')

  prompt = ["Describe the image ", image]
  response = model.generate_content(prompt)

except Exception as e:
  print(f"An error occurred: {e}")
  print("Please check your API key and ensure the 'image' variable is defined.")

print(response.text)

output

This vibrant and bustling image captures a grand religious festival, most likely the Rath Yatra (Chariot Festival) in Puri, India, given the distinctive architecture and the large, decorated chariots.

In the foreground and midground, two towering, elaborately decorated chariots dominate the scene. The central chariot, slightly larger and more prominently featured, is predominantly red with bright yellow vertical stripes and intricate patterns, possibly depicting symbols and deities. It has a multi-tiered, conical canopy-like roof topped with a golden finial. The base of this chariot is adorned with colorful fabrics, garlands, and sculptural elements, and is surrounded by a dense crowd of people. Another similar, though partially obscured, chariot stands to its left, also red and yellow with ornate decorations.

The entire lower half of the image is filled with an immense congregation of people, packed tightly around the chariots and extending into the foreground. Many are dressed in traditional Indian attire, with a mix of colorful and light-colored garments. Some individuals are seen climbing wooden ramps leading up to the chariots, while others are on the chariots themselves. Security personnel in khaki uniforms are visible throughout the crowd, attempting to manage the large gathering.

In the background, the distinctive golden shikhara (spire) of a large temple, characteristic of Kalinga architecture and likely the Jagannath Temple, rises prominently. Its stepped layers are visible, and numerous spectators are perched on its lower roofs and outer walls, observing the festivities from above. To the far left, another smaller, cream-colored temple dome is visible. Various other buildings, some with traditional pitched roofs and others with flat roofs, are scattered throughout the background. One building wall features a Swastika symbol, and another has banners with text in Odia script, one displaying "4G" and a picture of what appears to be PM Modi, providing a contemporary context to the ancient ritual. Green trees are visible on the horizon to the right.

The sky is overcast, suggesting either an early morning or a cloudy day. The overall impression is one of intense spiritual energy, devotion, and a massive cultural celebration.

Step 4: Display the Image and Get a Response

Before sending the image to the Gemini API, we resize it to a thumbnail of 512x512 pixels to ensure efficient handling. We then display the image using Colab’s IPython.display. The API response is formatted using Markdown for cleaner output, making the result easier to read directly in a notebook environment.


import google.generativeai as genai
from google.colab import userdata
from IPython.display import display, Markdown

# Resize and display image
image.thumbnail([512, 512])
display(image)

try:
  GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
  genai.configure(api_key=GOOGLE_API_KEY)

  # Initialize the model
  model = genai.GenerativeModel('gemini-2.5-flash')

  prompt = ["Describe the image ", image]
  response = model.generate_content(prompt)

  Markdown(response.text)

except Exception as e:
  print(f"An error occurred: {e}")
  print("Please check your API key and ensure the 'image' variable is defined.")

Using uploaded image to read hand written text

In above code insert this line to create the image object after uploading the image to colab.

image = Image.open('hand-written-text.jpg', mode="r")  # create image object

Step 5: Working with a Local Image File

In this step, instead of downloading the image from a URL, we use a locally available image file. Here, a handwritten note image named hand-written-text.jpg is loaded, resized, and displayed. The Gemini API then analyzes the content of the image and returns a descriptive output. This is especially useful for tasks like reading handwritten content or analyzing documents visually.


import google.generativeai as genai
from google.colab import userdata
from IPython.display import display, Markdown

# Load and resize a local handwritten image
image = Image.open('hand-written-text.jpg', mode="r")
image.thumbnail([512, 512])
display(image)

try:
  GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
  genai.configure(api_key=GOOGLE_API_KEY)

  model = genai.GenerativeModel('gemini-2.5-flash')
  prompt = ["Describe the image ", image]
  response = model.generate_content(prompt)
  Markdown(response.text)

except Exception as e:
  print(f"An error occurred: {e}")
  print("Please check your API key and ensure the 'image' variable is defined.")

💻 Local Python Script for VS Code or any other platform.

We have to keep the API Key in local .env file in same Directory and read from the script.

🔐 .env File (Same Directory)

Do not add quotes around the key value.
If you're using Git, add .env to your .gitignore to avoid exposing your API key.

GOOGLE_API_KEY=your_actual_api_key_here

Here is the code to get the API key from local file and configure the same with the Model.

import os
from PIL import Image
import google.generativeai as genai
from dotenv import load_dotenv

# Load API key from .env file
load_dotenv()
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

try:
    # Configure Gemini API
    genai.configure(api_key=GOOGLE_API_KEY)

    # Load and resize the image
    image = Image.open("your-image.jpg")  # Replace with your image filename with path
    image.thumbnail([512, 512])
    image.show()

    # Initialize Gemini model
    model = genai.GenerativeModel('gemini-2.5-flash')

    # Send prompt with image
    prompt = ["Describe the image", image]
    response = model.generate_content(prompt)

    # Print the AI response
    print(response.text)

except Exception as e:
    print(f"An error occurred: {e}")
    print("Please check your API key and ensure the image file is valid.")

📘 Generate a Travel-Themed Image Description PDF Using Gemini AI

This script uses Google Gemini AI and the ReportLab library to create a beautifully formatted PDF from a list of image URLs. Each page includes a resized image and a short AI-generated description. Ideal for creating coffee table books, travel journals, or AI-assisted photo essays, this tool blends automation and creativity seamlessly.

Gemini AI Integration: The script uses the Gemini model to analyze each image and return a brief description.
PDF Generation: With reportlab.pdfgen.canvas, it dynamically creates one page per image, ensuring both image and description stay within the same page.
Image Handling: Pillow (PIL) is used to fetch, resize, and convert the image before embedding.
Text Wrapping: The description is automatically wrapped and limited to three lines to maintain layout aesthetics.

# Gemini AI prompt with image input
prompt = ["Describe the image in 200 words:", gemini_image]
result = model.generate_content(prompt)

# Prepare and wrap description text
full_text = result.text.strip().replace('\n', ' ')
wrapped_lines = wrap(full_text, width=90)[:3]

# Draw image and description on the canvas
c.drawImage(img_reader, image_x, image_y, width=img.width, height=img.height)
c.setFont("Helvetica", 12)
for line in wrapped_lines:
    c.drawString(50, text_y, line)
    text_y -= 18

Full code is here

# Required Libraries
import os
import requests
import io
from PIL import Image
from reportlab.lib.pagesizes import A4
from reportlab.pdfgen import canvas
from reportlab.lib.utils import ImageReader
from textwrap import wrap
import google.generativeai as genai
from dotenv import load_dotenv

# Load Gemini API Key
load_dotenv()
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
genai.configure(api_key=GOOGLE_API_KEY)

# Initialize Gemini Model
model = genai.GenerativeModel("gemini-1.5-flash")

# List of Image URLs
image_urls = [
    "https://www.go2india.in/upimg/9565.jpg",
    "https://www.go2india.in/upimg/9561.jpg",
    "https://www.go2india.in/upimg/9559.jpg"
]

# Output PDF Path
pdf_path = "E:\\testing3\\gemini\\travel_book1.pdf"
c = canvas.Canvas(pdf_path, pagesize=A4)
page_width, page_height = A4

for idx, url in enumerate(image_urls, start=1):
    try:
        print(f"Processing {url}")

        # Download and Convert Image
        response = requests.get(url)
        img = Image.open(io.BytesIO(response.content)).convert("RGB")

        # Resize Image to Fit PDF
        max_img_width = page_width - 100
        img.thumbnail((max_img_width, 400))
        img_reader = ImageReader(img)

        # Prepare Image for Gemini API
        img_bytes = io.BytesIO()
        img.save(img_bytes, format="JPEG")
        gemini_image = {
            "mime_type": "image/jpeg",
            "data": img_bytes.getvalue()
        }

        # Get AI-generated Description
        prompt = ["Describe the image in 100 words :", gemini_image]
        result = model.generate_content(prompt)

        # Wrap Text to Fit Page Width
        full_text = result.text.strip().replace('\n', ' ')
        wrapped_lines = wrap(full_text, width=90)

        # Draw Image on Page
        image_x = 50
        image_y = page_height - img.height - 100
        c.drawImage(img_reader, image_x, image_y, width=img.width, height=img.height)

        # Draw Description Text Below Image
        text_y = image_y - 30
        c.setFont("Helvetica", 12)
        for line in wrapped_lines:
            c.drawString(50, text_y, line)
            text_y -= 18

        c.showPage()

    except Exception as e:
        print(f"Error processing {url}: {e}")

# Finalize and Save the PDF
c.save()
print(f"\n✅ PDF saved at: {pdf_path}")

Frequently Asked Questions

Q1: How can I extract text from an image using Gemini AI?

You can use the Gemini Vision model via the `generate_content()` method by passing a prompt and an image (as a PIL object). This can be done in Google Colab or any local Python environment.

Q2: What image formats are supported?

Common formats like JPG and PNG are supported as long as they are loaded as PIL Image objects.

Q3: How can I send multiline prompts?

You can use `"\n"` or triple quotes (`"""`) to send a multiline prompt, or pass it as part of a list along with the image object.

Q4: What does the temperature parameter do?

The temperature controls the randomness of the model's output. Higher values (e.g., 1.0) make the responses more creative; lower values (e.g., 0.2) make them more focused and deterministic.

Q5: Can I limit the length of the response?

Yes, by setting the `max_tokens` parameter in the `generate_content()` call, you can restrict the length of the response generated by Gemini AI.

Q6: Do I need internet access to use Gemini API?

Yes, Gemini API calls are made over the internet and require a valid API key and active internet connection.

Q7: Can I use this on my local machine instead of Colab?

Yes, you can run the same code locally by securely loading the API key using a `.env` file and installing the required packages using pip.

Conclusion

Using the Gemini API with image inputs opens up powerful possibilities for AI-assisted visual understanding. Whether you're analyzing handwritten notes, product photos, or diagrams, this workflow in Google Colab is efficient and easy to extend. As Gemini continues to evolve, you’ll be able to build even smarter applications by combining text, images, and other media inputs. Stay tuned for more examples and advanced integrations.

AI Gemini for text prompts

💡 New! Build a desktop AI chatbot using Tkinter and Gemini AI — complete with settings and chat history. Read Tutorial

« Python Colab AI Tools & Concepts Next »

Subhendu Mohapatra

Author

🎥 Join me live on YouTube

Passionate about coding and teaching, I publish practical tutorials on PHP, Python, JavaScript, SQL, and web development. My goal is to make learning simple, engaging, and project‑oriented with real examples and source code.