Integrating Local LLMs into iOS Apps with MLX Swift

The rise of powerful language models has transformed how we build applications, but running these models typically requires cloud infrastructure and internet connectivity. With Apple’s MLX framework, we can now run sophisticated language models directly on iOS devices, providing privacy, offline capability, and surprisingly good performance. In this guide, I’ll show you how to integrate MLX into your iOS apps to create powerful AI-driven experiences.

Why On-Device LLMs Matter

Before diving into the implementation, let’s understand why you might want to run language models locally on iOS devices:

Privacy First: User conversations never leave the device
Offline Capability: Works without internet connection
No API Costs: No per-token charges or rate limits
Low Latency: No network round-trips
Predictable Performance: No dependency on external services

The trade-offs? You’re limited by device capabilities and model size, and MLX requires a physical device - it doesn’t work in the iOS Simulator due to Metal GPU requirements. But with Apple Silicon’s Neural Engine and unified memory architecture, modern iPhones can run impressively capable models.

Setting Up MLX Swift

First, add the MLX Swift packages to your Xcode project. You’ll need both the core MLX framework and the LLM examples package:

// In Package.swift or through Xcode's package manager
dependencies: [
    .package(url: "https://github.com/ml-explore/mlx-swift", from: "0.25.4"),
    .package(url: "https://github.com/ml-explore/mlx-swift-examples", from: "2.25.4")
]

Then add these products to your target:

.product(name: "MLX", package: "mlx-swift"),
.product(name: "MLXNN", package: "mlx-swift"),
.product(name: "MLXLinalg", package: "mlx-swift"),
.product(name: "MLXFFT", package: "mlx-swift"),
.product(name: "MLXLLM", package: "mlx-swift-examples"),
.product(name: "MLXLMCommon", package: "mlx-swift-examples")

Creating the MLX Service Layer

Start by creating a service class that wraps the MLX functionality. This provides a clean interface for the rest of your app:

import Foundation
import MLX
import MLXLLM
import MLXLMCommon

@MainActor
class MLXService {
    static let shared = MLXService()
    
    private var currentModel: LLMModel?
    private let modelCache = NSCache<NSString, LLMModel>()
    
    struct ModelConfiguration {
        let name: String
        let hubId: String
        let defaultPrompt: String
        
        static let llama3_2_3B = ModelConfiguration(
            name: "Llama-3.2-3B",
            hubId: "mlx-community/Llama-3.2-3B-Instruct-4bit",
            defaultPrompt: "You are a helpful assistant."
        )
    }
}

Model Download and Management

One of the most critical aspects is handling model downloads gracefully. These models can be 1-2GB, so you need proper progress tracking and user feedback:

func downloadModel(
    configuration: ModelConfiguration,
    progressHandler: @escaping (Double, String) -> Void
) async throws {
    let hubApi = HubApi()
    
    let repo = Hub.Repo(id: configuration.hubId)
    let modelFiles = ["model.safetensors", "config.json", "tokenizer.json"]
    
    var totalSize: Int64 = 0
    var downloadedSize: Int64 = 0
    
    // Calculate total size first
    for file in modelFiles {
        if let fileInfo = try? await hubApi.fileInfo(repo: repo, path: file) {
            totalSize += fileInfo.size ?? 0
        }
    }
    
    // Download with progress
    for file in modelFiles {
        let localURL = try await hubApi.download(
            repo: repo,
            path: file,
            progressHandler: { progress in
                let currentProgress = Double(downloadedSize + Int64(progress.completedUnitCount)) / Double(totalSize)
                let message = formatProgress(progress: currentProgress, totalSize: totalSize)
                progressHandler(currentProgress, message)
            }
        )
        
        if let fileSize = try? FileManager.default.attributesOfItem(atPath: localURL.path)[.size] as? Int64 {
            downloadedSize += fileSize
        }
    }
}

private func formatProgress(progress: Double, totalSize: Int64) -> String {
    let downloadedMB = Double(totalSize) * progress / 1_048_576
    let totalMB = Double(totalSize) / 1_048_576
    
    if progress < 0.3 {
        return String(format: "Downloading model... %.0fMB/%.0fMB", downloadedMB, totalMB)
    } else if progress < 0.7 {
        return String(format: "Processing model files... %.0fMB/%.0fMB", downloadedMB, totalMB)
    } else {
        return String(format: "Finalizing download... %.0fMB/%.0fMB", downloadedMB, totalMB)
    }
}

Loading and Initializing Models

Once downloaded, loading the model requires careful memory management:

func loadModel(configuration: ModelConfiguration) async throws -> LLMModel? {
    // Handle simulator mode gracefully
    if DeviceUtility.isSimulator {
        print("🔮 MLX: Running in simulator - using mock mode")
        // Simulate loading progress
        let progress = Progress(totalUnitCount: 100)
        for i in 0...100 {
            progress.completedUnitCount = Int64(i)
            progressHandler?(progress)
            try await Task.sleep(nanoseconds: 10_000_000) // 10ms
        }
        return nil // No actual model in simulator
    }
    
    // Check cache first
    if let cachedModel = modelCache.object(forKey: configuration.hubId as NSString) {
        return cachedModel
    }
    
    // Configure memory limits for iOS
    MLX.GPU.set(cacheLimit: 512 * 1024 * 1024) // 512MB limit
    
    // Load the model
    let modelContainer = try await LLMModel.load(
        configuration: .init(id: configuration.hubId)
    )
    
    // Configure generation parameters
    modelContainer.configuration = .init(
        temperature: 0.7,
        topP: 0.9,
        maxTokens: 800
    )
    
    // Cache for future use
    modelCache.setObject(modelContainer, forKey: configuration.hubId as NSString)
    
    return modelContainer
}

Streaming Text Generation

The real magic happens when generating text. MLX provides token-by-token streaming, which creates a natural chat experience:

func generateResponse(
    prompt: String,
    systemPrompt: String? = nil,
    model: LLMModel
) -> AsyncThrowingStream<String, Error> {
    AsyncThrowingStream { continuation in
        Task {
            do {
                // Build the conversation
                var messages: [[String: String]] = []
                
                if let system = systemPrompt {
                    messages.append(["role": "system", "content": system])
                }
                
                messages.append(["role": "user", "content": prompt])
                
                // Apply chat template
                let fullPrompt = try model.applyChatTemplate(messages: messages)
                
                // Generate tokens
                let startTime = Date()
                var tokenCount = 0
                var outputText = ""
                
                let stream = try await model.generate(
                    prompt: fullPrompt,
                    maxTokens: 800
                )
                
                for try await token in stream {
                    tokenCount += 1
                    outputText += token
                    
                    // Filter out model artifacts
                    let cleanToken = token.replacingOccurrences(of: "<think>", with: "")
                                          .replacingOccurrences(of: "</think>", with: "")
                    
                    if !cleanToken.isEmpty {
                        continuation.yield(cleanToken)
                    }
                }
                
                // Log performance stats
                let duration = Date().timeIntervalSince(startTime)
                let tokensPerSecond = Double(tokenCount) / duration
                print("Generated \(tokenCount) tokens in \(duration)s (\(tokensPerSecond) tokens/sec)")
                
                continuation.finish()
                
            } catch {
                continuation.finish(throwing: error)
            }
        }
    }
}

Building the Chat UI

For the user interface, create a chat view with streaming text support:

class ChatViewController: UIViewController {
    private let chatTableView = UITableView()
    private let inputContainerView = UIView()
    private let textField = UITextField()
    private let sendButton = UIButton()
    
    private var messages: [ChatMessage] = []
    private var currentStreamingText = ""
    private var streamingIndexPath: IndexPath?
    
    private func sendMessage() {
        guard let text = textField.text, !text.isEmpty else { return }
        
        // Add user message
        let userMessage = ChatMessage(
            text: text,
            isUser: true
        )
        messages.append(userMessage)
        
        // Add empty assistant message for streaming
        let assistantMessage = ChatMessage(
            text: "",
            isUser: false
        )
        messages.append(assistantMessage)
        
        let assistantIndex = messages.count - 1
        streamingIndexPath = IndexPath(row: assistantIndex, section: 0)
        
        tableView.reloadData()
        scrollToBottom()
        
        // Start generation
        Task {
            do {
                let stream = MLXService.shared.generateResponse(
                    prompt: text,
                    systemPrompt: "You are a helpful assistant.",
                    model: loadedModel
                )
                
                for try await token in stream {
                    await MainActor.run {
                        currentStreamingText += token
                        messages[assistantIndex].text = currentStreamingText
                        
                        // Update just the streaming cell
                        if let indexPath = streamingIndexPath {
                            tableView.reloadRows(at: [indexPath], with: .none)
                        }
                    }
                }
                
                streamingIndexPath = nil
                currentStreamingText = ""
                
            } catch {
                showError(error)
            }
        }
        
        textField.text = ""
    }
}

Developing with iOS Simulator

One important consideration when working with MLX is that models don’t run in the iOS Simulator. The simulator lacks the Metal GPU support required by MLX, which means you’ll need to handle this gracefully during development. Here’s a pattern I’ve found works well:

// Create a utility to detect simulator environment
struct DeviceUtility {
    static var isSimulator: Bool {
        #if targetEnvironment(simulator)
        return true
        #else
        return false
        #endif
    }
}

// In your MLX service, provide mock responses for simulator
func generateResponse(
    prompt: String,
    systemPrompt: String? = nil
) -> AsyncThrowingStream<String, Error> {
    // Handle simulator mode
    if DeviceUtility.isSimulator {
        return createSimulatorResponse(for: prompt)
    }
    
    // Normal device flow
    guard let model = currentModel else {
        throw MLXError.modelNotLoaded
    }
    
    // ... rest of implementation
}

private func createSimulatorResponse(
    for prompt: String
) -> AsyncThrowingStream<String, Error> {
    AsyncThrowingStream { continuation in
        Task {
            let mockResponse = """
            🔮 [Simulator Mode] The oracle speaks through the digital void...
            
            Regarding your question: '\(prompt)'
            
            The true oracle requires a physical device to channel divine wisdom.
            This is merely a shadow of what could be...
            """
            
            // Simulate streaming for realistic UI testing
            let words = mockResponse.split(separator: " ")
            for word in words {
                continuation.yield(String(word) + " ")
                try? await Task.sleep(nanoseconds: 50_000_000) // 50ms delay
            }
            
            continuation.finish()
        }
    }
}

This approach lets you:

Develop and test your UI in the simulator
Provide clear feedback about simulator limitations
Maintain the same streaming interface for consistent behavior
Avoid runtime crashes from missing Metal support

For the UI, you can show a simulator-specific message:

private func updateDownloadUI() {
    if DeviceUtility.isSimulator {
        titleLabel.text = "Oracle Simulator Mode"
        descriptionLabel.text = "You're running in the iOS Simulator. " +
            "The Oracle will provide mock responses for testing. " +
            "Deploy to a physical device for real AI conversations."
        downloadButton.setTitle("Enable Simulator Oracle", for: .normal)
        iconView.image = UIImage(systemName: "desktopcomputer")
    }
}

Memory Management and Performance

Running LLMs on mobile devices requires careful attention to memory usage:

class MLXModelManager {
    private func checkMemoryPressure() {
        let memoryStatus = ProcessInfo.processInfo.physicalMemory
        let availableMemory = getAvailableMemory()
        
        if availableMemory < 500_000_000 { // Less than 500MB available
            // Clear model cache if needed
            modelCache.removeAllObjects()
            
            // Force garbage collection
            MLX.GPU.set(cacheLimit: 256 * 1024 * 1024) // Reduce to 256MB
        }
    }
    
    private func getAvailableMemory() -> Int64 {
        var info = mach_task_basic_info()
        var count = mach_msg_type_number_t(MemoryLayout<mach_task_basic_info>.size) / 4
        
        let result = withUnsafeMutablePointer(to: &info) {
            $0.withMemoryRebound(to: integer_t.self, capacity: 1) {
                task_info(mach_task_self_,
                         task_flavor_t(MACH_TASK_BASIC_INFO),
                         $0,
                         &count)
            }
        }
        
        return result == KERN_SUCCESS ? Int64(info.resident_size) : 0
    }
}

Model Selection Strategy

Not all models are suitable for mobile devices. Here’s what I’ve found works well:

SmolLM-135M: Ultra-fast, good for simple tasks
Qwen-1.7B (4-bit): Good balance of quality and speed
Llama-3.2-3B (4-bit): Best quality that runs smoothly on iPhone
Larger models (7B+): Possible but slow, better for iPad

Choose based on your use case and target devices. For chat applications, 1-3B parameter models offer the best experience.

Pro Tips for Production

Preload Models: Load models on app launch if previously downloaded
Background Downloads: Use URLSession background configuration for large models
Smooth Progress: Average download progress over time to avoid jumpy UI
Simulator Development: Implement mock responses for productive UI development
Device Testing: Always test on real devices, especially older iPhones with less RAM
Fallback Options: Provide cloud-based fallback for older devices
Clear Messaging: When in simulator, clearly communicate limitations to developers

Conclusion

Integrating local LLMs into iOS apps with MLX opens up exciting possibilities for privacy-conscious, offline-capable AI features. While there are constraints compared to cloud-based solutions, the framework is remarkably capable and getting better with each release.

The key to success is thoughtful UX design around model downloads, memory management, and setting appropriate user expectations. With these considerations in mind, you can create compelling AI-powered experiences that run entirely on-device.

Whether you’re building a chat assistant, content generator, or any other AI-powered feature, MLX provides the tools to run sophisticated language models directly on iOS devices. The future of mobile AI is local, and MLX makes it accessible to every iOS developer.