Quick Start
ShareGPT 4O is a comprehensive dataset of 92,256 high-quality samples generated by GPT-4o for training multimodal AI models. This guide will help you get started with accessing, downloading, and using the dataset effectively.
Prerequisites
Before working with ShareGPT 4O, ensure you have the following requirements met:
- Python 3.8 or higher installed
- Git for cloning repositories
- At least 50GB of available storage space
- Stable internet connection for downloading the dataset
- Basic familiarity with machine learning concepts
Accessing the Dataset
The ShareGPT 4O dataset is hosted on Hugging Face Hub, making it easily accessible through the datasets library. Follow these steps to get started:
Step 1: Install Required Libraries
pip install datasets torch transformers
pip install huggingface_hub
pip install pillow pandas numpy
Step 2: Load the Dataset
from datasets import load_dataset
# Load the complete dataset
dataset = load_dataset("FreedomIntelligence/ShareGPT-4o-Image")
# Access specific subsets
text_to_image = dataset["1_text_to_image"]
text_image_to_image = dataset["2_text_and_image_to_image"]
print(f"Text-to-image samples: {len(text_to_image)}")
print(f"Text-and-image-to-image samples: {len(text_image_to_image)}")
Understanding Dataset Structure
The ShareGPT 4O dataset is organized into two main categories, each with specific fields and formats:
Text-to-Image Samples
- input_prompt: Text description for image generation
- output_image: Generated image filename
- output_image_resolution: Image dimensions [width, height]
Text-and-Image-to-Image Samples
- input_prompt: Text instruction for image modification
- input_image: Source image filename
- output_image: Modified image filename
- output_image_resolution: Output image dimensions
Working with the Data
Here's a practical example of how to iterate through the dataset and access individual samples:
import pandas as pd
from PIL import Image
import requests
from io import BytesIO
# Load and explore text-to-image data
dataset = load_dataset("FreedomIntelligence/ShareGPT-4o-Image", "1_text_to_image")
train_data = dataset["train"]
# Display first few samples
for i in range(3):
sample = train_data[i]
print(f"Sample {i+1}:")
print(f"Prompt: {sample['input_prompt'][:100]}...")
print(f"Image: {sample['output_image']}")
print(f"Resolution: {sample['output_image_resolution']}")
print("-" * 50)
Using with Janus-4o Model
The ShareGPT 4O dataset was specifically designed for training the Janus-4o model. Here's how to get started with the pre-trained model:
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
# Load the Janus-4o model
model_path = "FreedomIntelligence/Janus-4o-7B"
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
vl_gpt = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16
)
# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
vl_gpt = vl_gpt.to(device).eval()
Best Practices
Data Loading
Use streaming mode for large datasets to avoid memory issues. Set streaming=True
when loading the dataset if you don't need to load everything into memory at once.
Preprocessing
Always validate image paths and prompts before training. Some samples may have missing or corrupted data that should be filtered out during preprocessing.
Memory Management
When working with high-resolution images, consider resizing them to a standard resolution (e.g., 512x512) to reduce memory usage during training while maintaining quality.
Common Issues and Solutions
Download Failures
If you experience download timeouts or failures, try these solutions:
- • Use a VPN if accessing from a restricted region
- • Increase timeout settings in your HTTP client
- • Download dataset parts individually rather than the full dataset
Memory Errors
For systems with limited RAM:
- • Use streaming mode when loading datasets
- • Process data in smaller batches
- • Consider using a machine with more memory for training
Next Steps
Now that you have ShareGPT 4O set up, here are some recommended next steps:
- • Explore the detailed tutorials for specific use cases
- • Read the research methodology to understand the dataset construction
- • Check out frequently asked questions for additional help