Integrating Local LLMs into iOS Apps with MLX Swift
The rise of powerful language models has transformed how we build applications, but running these models typically requires cloud infrastructure and internet connectivity. With Apple’s MLX framework, we can now run sophisticated language models directly on iOS devices, providing privacy, offline capability, and surprisingly good performance. In this guide, I’ll show you how to integrate MLX into your iOS apps to create powerful AI-driven experiences.
Why On-Device LLMs Matter
Before diving into the implementation, let’s understand why you might want to run language models locally on iOS devices:
- Privacy First: User conversations never leave the device
- Offline Capability: Works without internet connection
- No API Costs: No per-token charges or rate limits
- Low Latency: No network round-trips
- Predictable Performance: No dependency on external services
The trade-offs? You’re limited by device capabilities and model size, and MLX requires a physical device - it doesn’t work in the iOS Simulator due to Metal GPU requirements. But with Apple Silicon’s Neural Engine and unified memory architecture, modern iPhones can run impressively capable models.
Setting Up MLX Swift
First, add the MLX Swift packages to your Xcode project. You’ll need both the core MLX framework and the LLM examples package:
// In Package.swift or through Xcode's package manager
dependencies: [
.package(url: "https://github.com/ml-explore/mlx-swift", from: "0.25.4"),
.package(url: "https://github.com/ml-explore/mlx-swift-examples", from: "2.25.4")
]
Then add these products to your target:
.product(name: "MLX", package: "mlx-swift"),
.product(name: "MLXNN", package: "mlx-swift"),
.product(name: "MLXLinalg", package: "mlx-swift"),
.product(name: "MLXFFT", package: "mlx-swift"),
.product(name: "MLXLLM", package: "mlx-swift-examples"),
.product(name: "MLXLMCommon", package: "mlx-swift-examples")
Creating the MLX Service Layer
Start by creating a service class that wraps the MLX functionality. This provides a clean interface for the rest of your app:
import Foundation
import MLX
import MLXLLM
import MLXLMCommon
@MainActor
class MLXService {
static let shared = MLXService()
private var currentModel: LLMModel?
private let modelCache = NSCache<NSString, LLMModel>()
struct ModelConfiguration {
let name: String
let hubId: String
let defaultPrompt: String
static let llama3_2_3B = ModelConfiguration(
name: "Llama-3.2-3B",
hubId: "mlx-community/Llama-3.2-3B-Instruct-4bit",
defaultPrompt: "You are a helpful assistant."
)
}
}
Model Download and Management
One of the most critical aspects is handling model downloads gracefully. These models can be 1-2GB, so you need proper progress tracking and user feedback:
func downloadModel(
configuration: ModelConfiguration,
progressHandler: @escaping (Double, String) -> Void
) async throws {
let hubApi = HubApi()
let repo = Hub.Repo(id: configuration.hubId)
let modelFiles = ["model.safetensors", "config.json", "tokenizer.json"]
var totalSize: Int64 = 0
var downloadedSize: Int64 = 0
// Calculate total size first
for file in modelFiles {
if let fileInfo = try? await hubApi.fileInfo(repo: repo, path: file) {
totalSize += fileInfo.size ?? 0
}
}
// Download with progress
for file in modelFiles {
let localURL = try await hubApi.download(
repo: repo,
path: file,
progressHandler: { progress in
let currentProgress = Double(downloadedSize + Int64(progress.completedUnitCount)) / Double(totalSize)
let message = formatProgress(progress: currentProgress, totalSize: totalSize)
progressHandler(currentProgress, message)
}
)
if let fileSize = try? FileManager.default.attributesOfItem(atPath: localURL.path)[.size] as? Int64 {
downloadedSize += fileSize
}
}
}
private func formatProgress(progress: Double, totalSize: Int64) -> String {
let downloadedMB = Double(totalSize) * progress / 1_048_576
let totalMB = Double(totalSize) / 1_048_576
if progress < 0.3 {
return String(format: "Downloading model... %.0fMB/%.0fMB", downloadedMB, totalMB)
} else if progress < 0.7 {
return String(format: "Processing model files... %.0fMB/%.0fMB", downloadedMB, totalMB)
} else {
return String(format: "Finalizing download... %.0fMB/%.0fMB", downloadedMB, totalMB)
}
}
Loading and Initializing Models
Once downloaded, loading the model requires careful memory management:
func loadModel(configuration: ModelConfiguration) async throws -> LLMModel? {
// Handle simulator mode gracefully
if DeviceUtility.isSimulator {
print("🔮 MLX: Running in simulator - using mock mode")
// Simulate loading progress
let progress = Progress(totalUnitCount: 100)
for i in 0...100 {
progress.completedUnitCount = Int64(i)
progressHandler?(progress)
try await Task.sleep(nanoseconds: 10_000_000) // 10ms
}
return nil // No actual model in simulator
}
// Check cache first
if let cachedModel = modelCache.object(forKey: configuration.hubId as NSString) {
return cachedModel
}
// Configure memory limits for iOS
MLX.GPU.set(cacheLimit: 512 * 1024 * 1024) // 512MB limit
// Load the model
let modelContainer = try await LLMModel.load(
configuration: .init(id: configuration.hubId)
)
// Configure generation parameters
modelContainer.configuration = .init(
temperature: 0.7,
topP: 0.9,
maxTokens: 800
)
// Cache for future use
modelCache.setObject(modelContainer, forKey: configuration.hubId as NSString)
return modelContainer
}
Streaming Text Generation
The real magic happens when generating text. MLX provides token-by-token streaming, which creates a natural chat experience:
func generateResponse(
prompt: String,
systemPrompt: String? = nil,
model: LLMModel
) -> AsyncThrowingStream<String, Error> {
AsyncThrowingStream { continuation in
Task {
do {
// Build the conversation
var messages: [[String: String]] = []
if let system = systemPrompt {
messages.append(["role": "system", "content": system])
}
messages.append(["role": "user", "content": prompt])
// Apply chat template
let fullPrompt = try model.applyChatTemplate(messages: messages)
// Generate tokens
let startTime = Date()
var tokenCount = 0
var outputText = ""
let stream = try await model.generate(
prompt: fullPrompt,
maxTokens: 800
)
for try await token in stream {
tokenCount += 1
outputText += token
// Filter out model artifacts
let cleanToken = token.replacingOccurrences(of: "<think>", with: "")
.replacingOccurrences(of: "</think>", with: "")
if !cleanToken.isEmpty {
continuation.yield(cleanToken)
}
}
// Log performance stats
let duration = Date().timeIntervalSince(startTime)
let tokensPerSecond = Double(tokenCount) / duration
print("Generated \(tokenCount) tokens in \(duration)s (\(tokensPerSecond) tokens/sec)")
continuation.finish()
} catch {
continuation.finish(throwing: error)
}
}
}
}
Building the Chat UI
For the user interface, create a chat view with streaming text support:
class ChatViewController: UIViewController {
private let chatTableView = UITableView()
private let inputContainerView = UIView()
private let textField = UITextField()
private let sendButton = UIButton()
private var messages: [ChatMessage] = []
private var currentStreamingText = ""
private var streamingIndexPath: IndexPath?
private func sendMessage() {
guard let text = textField.text, !text.isEmpty else { return }
// Add user message
let userMessage = ChatMessage(
text: text,
isUser: true
)
messages.append(userMessage)
// Add empty assistant message for streaming
let assistantMessage = ChatMessage(
text: "",
isUser: false
)
messages.append(assistantMessage)
let assistantIndex = messages.count - 1
streamingIndexPath = IndexPath(row: assistantIndex, section: 0)
tableView.reloadData()
scrollToBottom()
// Start generation
Task {
do {
let stream = MLXService.shared.generateResponse(
prompt: text,
systemPrompt: "You are a helpful assistant.",
model: loadedModel
)
for try await token in stream {
await MainActor.run {
currentStreamingText += token
messages[assistantIndex].text = currentStreamingText
// Update just the streaming cell
if let indexPath = streamingIndexPath {
tableView.reloadRows(at: [indexPath], with: .none)
}
}
}
streamingIndexPath = nil
currentStreamingText = ""
} catch {
showError(error)
}
}
textField.text = ""
}
}
Developing with iOS Simulator
One important consideration when working with MLX is that models don’t run in the iOS Simulator. The simulator lacks the Metal GPU support required by MLX, which means you’ll need to handle this gracefully during development. Here’s a pattern I’ve found works well:
// Create a utility to detect simulator environment
struct DeviceUtility {
static var isSimulator: Bool {
#if targetEnvironment(simulator)
return true
#else
return false
#endif
}
}
// In your MLX service, provide mock responses for simulator
func generateResponse(
prompt: String,
systemPrompt: String? = nil
) -> AsyncThrowingStream<String, Error> {
// Handle simulator mode
if DeviceUtility.isSimulator {
return createSimulatorResponse(for: prompt)
}
// Normal device flow
guard let model = currentModel else {
throw MLXError.modelNotLoaded
}
// ... rest of implementation
}
private func createSimulatorResponse(
for prompt: String
) -> AsyncThrowingStream<String, Error> {
AsyncThrowingStream { continuation in
Task {
let mockResponse = """
🔮 [Simulator Mode] The oracle speaks through the digital void...
Regarding your question: '\(prompt)'
The true oracle requires a physical device to channel divine wisdom.
This is merely a shadow of what could be...
"""
// Simulate streaming for realistic UI testing
let words = mockResponse.split(separator: " ")
for word in words {
continuation.yield(String(word) + " ")
try? await Task.sleep(nanoseconds: 50_000_000) // 50ms delay
}
continuation.finish()
}
}
}
This approach lets you:
- Develop and test your UI in the simulator
- Provide clear feedback about simulator limitations
- Maintain the same streaming interface for consistent behavior
- Avoid runtime crashes from missing Metal support
For the UI, you can show a simulator-specific message:
private func updateDownloadUI() {
if DeviceUtility.isSimulator {
titleLabel.text = "Oracle Simulator Mode"
descriptionLabel.text = "You're running in the iOS Simulator. " +
"The Oracle will provide mock responses for testing. " +
"Deploy to a physical device for real AI conversations."
downloadButton.setTitle("Enable Simulator Oracle", for: .normal)
iconView.image = UIImage(systemName: "desktopcomputer")
}
}
Memory Management and Performance
Running LLMs on mobile devices requires careful attention to memory usage:
class MLXModelManager {
private func checkMemoryPressure() {
let memoryStatus = ProcessInfo.processInfo.physicalMemory
let availableMemory = getAvailableMemory()
if availableMemory < 500_000_000 { // Less than 500MB available
// Clear model cache if needed
modelCache.removeAllObjects()
// Force garbage collection
MLX.GPU.set(cacheLimit: 256 * 1024 * 1024) // Reduce to 256MB
}
}
private func getAvailableMemory() -> Int64 {
var info = mach_task_basic_info()
var count = mach_msg_type_number_t(MemoryLayout<mach_task_basic_info>.size) / 4
let result = withUnsafeMutablePointer(to: &info) {
$0.withMemoryRebound(to: integer_t.self, capacity: 1) {
task_info(mach_task_self_,
task_flavor_t(MACH_TASK_BASIC_INFO),
$0,
&count)
}
}
return result == KERN_SUCCESS ? Int64(info.resident_size) : 0
}
}
Model Selection Strategy
Not all models are suitable for mobile devices. Here’s what I’ve found works well:
- SmolLM-135M: Ultra-fast, good for simple tasks
- Qwen-1.7B (4-bit): Good balance of quality and speed
- Llama-3.2-3B (4-bit): Best quality that runs smoothly on iPhone
- Larger models (7B+): Possible but slow, better for iPad
Choose based on your use case and target devices. For chat applications, 1-3B parameter models offer the best experience.
Pro Tips for Production
- Preload Models: Load models on app launch if previously downloaded
- Background Downloads: Use URLSession background configuration for large models
- Smooth Progress: Average download progress over time to avoid jumpy UI
- Simulator Development: Implement mock responses for productive UI development
- Device Testing: Always test on real devices, especially older iPhones with less RAM
- Fallback Options: Provide cloud-based fallback for older devices
- Clear Messaging: When in simulator, clearly communicate limitations to developers
Conclusion
Integrating local LLMs into iOS apps with MLX opens up exciting possibilities for privacy-conscious, offline-capable AI features. While there are constraints compared to cloud-based solutions, the framework is remarkably capable and getting better with each release.
The key to success is thoughtful UX design around model downloads, memory management, and setting appropriate user expectations. With these considerations in mind, you can create compelling AI-powered experiences that run entirely on-device.
Whether you’re building a chat assistant, content generator, or any other AI-powered feature, MLX provides the tools to run sophisticated language models directly on iOS devices. The future of mobile AI is local, and MLX makes it accessible to every iOS developer.