Getting Started with ShareGPT 4O

Complete guide to accessing and using the ShareGPT 4O dataset for your research projects

January 27, 20258 min readDocumentation

Quick Start

ShareGPT 4O is a comprehensive dataset of 92,256 high-quality samples generated by GPT-4o for training multimodal AI models. This guide will help you get started with accessing, downloading, and using the dataset effectively.

Prerequisites

Before working with ShareGPT 4O, ensure you have the following requirements met:

  • Python 3.8 or higher installed
  • Git for cloning repositories
  • At least 50GB of available storage space
  • Stable internet connection for downloading the dataset
  • Basic familiarity with machine learning concepts

Accessing the Dataset

The ShareGPT 4O dataset is hosted on Hugging Face Hub, making it easily accessible through the datasets library. Follow these steps to get started:

Step 1: Install Required Libraries

pip install datasets torch transformers
pip install huggingface_hub
pip install pillow pandas numpy

Step 2: Load the Dataset

from datasets import load_dataset

# Load the complete dataset
dataset = load_dataset("FreedomIntelligence/ShareGPT-4o-Image")

# Access specific subsets
text_to_image = dataset["1_text_to_image"]
text_image_to_image = dataset["2_text_and_image_to_image"]

print(f"Text-to-image samples: {len(text_to_image)}")
print(f"Text-and-image-to-image samples: {len(text_image_to_image)}")

Understanding Dataset Structure

The ShareGPT 4O dataset is organized into two main categories, each with specific fields and formats:

Text-to-Image Samples

  • input_prompt: Text description for image generation
  • output_image: Generated image filename
  • output_image_resolution: Image dimensions [width, height]

Text-and-Image-to-Image Samples

  • input_prompt: Text instruction for image modification
  • input_image: Source image filename
  • output_image: Modified image filename
  • output_image_resolution: Output image dimensions

Working with the Data

Here's a practical example of how to iterate through the dataset and access individual samples:

import pandas as pd
from PIL import Image
import requests
from io import BytesIO

# Load and explore text-to-image data
dataset = load_dataset("FreedomIntelligence/ShareGPT-4o-Image", "1_text_to_image")
train_data = dataset["train"]

# Display first few samples
for i in range(3):
    sample = train_data[i]
    print(f"Sample {i+1}:")
    print(f"Prompt: {sample['input_prompt'][:100]}...")
    print(f"Image: {sample['output_image']}")
    print(f"Resolution: {sample['output_image_resolution']}")
    print("-" * 50)

Using with Janus-4o Model

The ShareGPT 4O dataset was specifically designed for training the Janus-4o model. Here's how to get started with the pre-trained model:

from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor

# Load the Janus-4o model
model_path = "FreedomIntelligence/Janus-4o-7B"
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
vl_gpt = AutoModelForCausalLM.from_pretrained(
    model_path, 
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
vl_gpt = vl_gpt.to(device).eval()

Best Practices

Data Loading

Use streaming mode for large datasets to avoid memory issues. Set streaming=True when loading the dataset if you don't need to load everything into memory at once.

Preprocessing

Always validate image paths and prompts before training. Some samples may have missing or corrupted data that should be filtered out during preprocessing.

Memory Management

When working with high-resolution images, consider resizing them to a standard resolution (e.g., 512x512) to reduce memory usage during training while maintaining quality.

Common Issues and Solutions

Download Failures

If you experience download timeouts or failures, try these solutions:

  • • Use a VPN if accessing from a restricted region
  • • Increase timeout settings in your HTTP client
  • • Download dataset parts individually rather than the full dataset

Memory Errors

For systems with limited RAM:

  • • Use streaming mode when loading datasets
  • • Process data in smaller batches
  • • Consider using a machine with more memory for training

Next Steps

Now that you have ShareGPT 4O set up, here are some recommended next steps: