Last weekend Chris Olds and I were discussing text search engines, and in particular how to take advantage of them to speed up searches of free-form text in Core Data. Here’s a summary of what we found. I haven’t tested or implemented any of these ideas. This is simply a summary of what’s out there.
I’m not including techniques that deal with fast searches of short text fields: normalizing your query strings and searchable text, using case-insensitive searches, etc. That’s all well documented by Apple and in the usual Core Data reference books.
I did run across one very cool article outlining a profiling method I hadn’t ever seen before. The Art & Logic Blog goes one step further in the typical use of com.apple.CoreData.SQLDebug. Take advantage of the fact that you have SQLite installed on your Mac! You can paste the SQL query being logged by your iOS app into SQLite on your Mac, and use the EXPLAIN QUERY command there to understand the search plan.
Full Text Search
Full text search (FTS) is about finding search terms within large bodies of text. This is different from matching someone’s last name to the lastName attribute in a Core Data entity. Imagine instead that your Core Data database contains notes, or newspaper articles, or patent descriptions, or travel resort reviews, and you want to search within the text of those articles. The brute force method is to scan all of the text of each article, searching for matches to the search term. That takes a very long time, and doesn’t always give you the results you want.
Ideally, your FTS within Core Data will respond as quickly as Google or Bing does when you enter a search term. The results will be ranked by relevance, The search will handle word stemming correctly: if I enter a search for “lodge”, I probably want to see results containing “lodges” or “lodging”, too. Core Data does not handle any of these need.
Roll Your Own
Michael Heyeck wrote an 8 part series of blog articles describing how to build your own FTS capability directly within Core Data, using only Core Data tools and constructs. It’s a very comprehensive series, and it’s a shame it isn’t more widely known. He doesn’t just teach you how to do FTS in Core Data. He also shows you how to read and understand the SQL queries that are generated on your behalf, and how to modify your NSPredicates and data model design to make the queries fast.
The series includes source code for a Notes application with FTS, under BSD license.
When you type something into the Spotlight search bar on your Mac, you’re using FTS. Mac OS X has already built an FTS index of the files on your system, and queries that index. Search Kit is the Foundation framework that Apple uses to deliver those search results, and it’s available to you too. The catch? It’s Mac only, and not integrated into Core Data.
When we were chatting, I mentioned to Chris that Search Kit would make a terrific NSHipster topic. The next day, that’s what happened! The NSHipster article also summarizes the technical issues in Full Text Search nicely.
Indragie Karunaratne has a project on Github that uses Search Kit to back Core Data searches. I’ve only read over the source, and haven’t tried it, but it looks solid. His approach is to build a Search Kit index that returns NSManagedObjectIDs of Core Data objects matching a particular full text search.
Locayta makes their FTS mobile search engine available to iOS developers: free for non-commercial use, $1000 per commercial app. It’s not integrated with Core Data. An approach similar to the one Indragie Karunaratne took with Search Kit integration would probably work, though.
The backing store most commonly used with Core Data, SQLite, includes FTS support. It’s just not exposed in any Core Data API (at least, not as of iOS 6.1).
Wolfert de Kraker describes a technique for using the SQLite FTS4 engine simultaneously with Core Data. It involves creating a Virtual Table within the same SQLite database that Core Data uses. Then he uses FMDB to create a search method which uses the FTS4 search to respond to UISearchDisplayController delegate calls. NSManagedObjectIDs are returned as the raw SQLite search results, and then Core Data retrieves these objects.
This 2010 Stack Overflow answer describes a similar approach. A different answer a few months later makes a sideways variation: instead of storing NSManagedObjectIDs in the shadow SQLite table, store SQLite row IDs as Core Data attributes.
These solutions included a custom copy of SQLite in their projects. Although they are iOS projects, I see no reason you couldn’t use the same approach on OS X.
I have to say that it makes me very nervous to think of mucking around in Core Data’s SQLite file. Call me superstitious.
Open Source FTS
We looked at two long-established open source FTS engines, Xapian and Lucene.
Lucene is a Java-based search engine, part of the Apache project. A port to Gnustep, Lucene Kit, was begun in 2005 and seems to have languished for a while. The most current version I found was https://github.com/zbowling/LuceneKit, which was active as recently as 2012.
Xapian is a C++ search engine, and the one that Chris uses in his production code. It is presently licensed under GPL, which would make for some complications if you were to include it in an iOS project. There was some mention on the Xapian forum of writing an Objective-C binding. The conclusion was that it should be straightforward, but that no one has done it yet.