Options for Full Text Search in Core Data

Last weekend Chris Olds and I were discussing text search engines, and in particular how to take advantage of them to speed up searches of free-form text in Core Data. Here’s a summary of what we found. I haven’t tested or implemented any of these ideas. This is simply a summary of what’s out there.

I’m not including techniques that deal with fast searches of short text fields: normalizing your query strings and searchable text, using case-insensitive searches, etc. That’s all well documented by Apple and in the usual Core Data reference books.

I did run across one very cool article outlining a profiling method I hadn’t ever seen before. The Art & Logic Blog goes one step further in the typical use of com.apple.CoreData.SQLDebug. Take advantage of the fact that you have SQLite installed on your Mac! You can paste the SQL query being logged by your iOS app into SQLite on your Mac, and use the EXPLAIN QUERY command there to understand the search plan.

Full Text Search

Full text search (FTS) is about finding search terms within large bodies of text. This is different from matching someone’s last name to the lastName attribute in a Core Data entity. Imagine instead that your Core Data database contains notes, or newspaper articles, or patent descriptions, or travel resort reviews, and you want to search within the text of those articles. The brute force method is to scan all of the text of each article, searching for matches to the search term. That takes a very long time, and doesn’t always give you the results you want.

Ideally, your FTS within Core Data will respond as quickly as Google or Bing does when you enter a search term. The results will be ranked by relevance, The search will handle word stemming correctly: if I enter a search for “lodge”, I probably want to see results containing “lodges” or “lodging”, too. Core Data does not handle any of these need.

Roll Your Own

Michael Heyeck wrote an 8 part series of blog articles describing how to build your own FTS capability directly within Core Data, using only Core Data tools and constructs. It’s a very comprehensive series, and it’s a shame it isn’t more widely known. He doesn’t just teach you how to do FTS in Core Data. He also shows you how to read and understand the SQL queries that are generated on your behalf, and how to modify your NSPredicates and data model design to make the queries fast.

The series includes source code for a Notes application with FTS, under BSD license.

Search Kit

When you type something into the Spotlight search bar on your Mac, you’re using FTS. Mac OS X has already built an FTS index of the files on your system, and queries that index. Search Kit is the Foundation framework that Apple uses to deliver those search results, and it’s available to you too. The catch? It’s Mac only, and not integrated into Core Data.

When we were chatting, I mentioned to Chris that Search Kit would make a terrific NSHipster topic. The next day, that’s what happened! The NSHipster article also summarizes the technical issues in Full Text Search nicely.

Indragie Karunaratne has a project on Github that uses Search Kit to back Core Data searches. I’ve only read over the source, and haven’t tried it, but it looks solid. His approach is to build a Search Kit index that returns NSManagedObjectIDs of Core Data objects matching a particular full text search.

Commercial Library

Locayta makes their FTS mobile search engine available to iOS developers: free for non-commercial use, $1000 per commercial app. It’s not integrated with Core Data. An approach similar to the one Indragie Karunaratne took with Search Kit integration would probably work, though.

Hackery

The backing store most commonly used with Core Data, SQLite, includes FTS support. It’s just not exposed in any Core Data API (at least, not as of iOS 6.1).

Wolfert de Kraker describes a technique for using the SQLite FTS4 engine simultaneously with Core Data. It involves creating a Virtual Table within the same SQLite database that Core Data uses. Then he uses FMDB to create a search method which uses the FTS4 search to respond to UISearchDisplayController delegate calls. NSManagedObjectIDs are returned as the raw SQLite search results, and then Core Data retrieves these objects.

This 2010 Stack Overflow answer describes a similar approach. A different answer a few months later makes a sideways variation: instead of storing NSManagedObjectIDs in the shadow SQLite table, store SQLite row IDs as Core Data attributes.

These solutions included a custom copy of SQLite in their projects. Although they are iOS projects, I see no reason you couldn’t use the same approach on OS X.

I found two other blog posts describing other implementations of this approach, one from Regular Rate & Rhythm and one from Long Weekend Mobile, both from 2010.

I have to say that it makes me very nervous to think of mucking around in Core Data’s SQLite file. Call me superstitious.

Open Source FTS

We looked at two long-established open source FTS engines, Xapian and Lucene.

Lucene is a Java-based search engine, part of the Apache project. A port to Gnustep, Lucene Kit, was begun in 2005 and seems to have languished for a while. The most current version I found was https://github.com/zbowling/LuceneKit, which was active as recently as 2012.

Xapian is a C++ search engine, and the one that Chris uses in his production code. It is presently licensed under GPL, which would make for some complications if you were to include it  in an iOS project. There was some mention on the Xapian forum of writing an Objective-C binding. The conclusion was that it should be straightforward, but that no one has done it yet.

 

Drobo FS and Lion: update

I just had a phone call from a Drobo senior engineer. He was very frank and direct. It was the sort of conversation two developers have when nobody from management is in the room.

Without going into detail, I have to say that I was impressed. They have of course been testing this setup thoroughly, since the very first Lion developer previews. The Drobo engineer outlined for me the testing procedures they’re using right now, to try to replicate the failures some of us are seeing. They haven’t been able to replicate it. If you can’t make something bleed, it’s hard to kill it.

If you’ve ever shipped software, you’ve faced this situation. A customer experiences some bug, maybe even an intermittent one, that you can’t reproduce yourself. It is maddeningly frustrating for both the developer and the customer.

We were on the phone for 45 minutes. He had very specific logfiles that he wanted from my system. He laid out for me the plan they have for killing this problem, and the multiple approaches seem very sound to me.

They do indeed need the performance tests that first-level support has been asking us to run.

Based on what I learned today I’m going to hang in there for a while longer.

I have created a new Time Machine share on the FS, and I’m running a backup to it now from one of my Lion machines. It’s working fine. I’m going to give it a couple more hours, then kill it, and apply the procedure that Sébastien used. The only change I’ll make is to mount the shares manually using SMB, instead of having to play Beat The Clock.

One other update is that my Snow Leopard machine, which was (immediately after the new firmware) seeing absurdly low throughput, is now functioning fine. I didn’t touch anything. I just let it work.

Drobo FS problems under Mac OS X Lion

I’ve been very disappointed with my Drobo FS during the switch to Lion. I had been using the FS for Time Machine for 3 Macs, and it had been a stable operation for 6 months.

The Lion problems seem to stem from the tighter AFP requirements in Lion. Drobo was clearly not ready. The latest firmware update has made matters worse.

When I upgraded my first machine to Lion a week ago, Time Machine could not connect to the FS. I filed a support ticket with Drobo, and received this astounding response:

We are currently in the process of a firmware update to fix apple’s more stringent AFP requirements. This is just not an issue with drobo. Although we are working on a solutionto fix the issue I do recommend making a complaint to apple if enough customers complain about what they did they may roll it back.

Read that last sentence again. Drobo wanted their customers to ask Apple to change the low-level data communication specs on a major OS upgrade, after the upgrade was released. I was dumbfounded.

Anyone who is in the software business has missed a deadline. I was inclined to cut Drobo some slack for not being ready. But that request for me to lobby Apple on their behalf really left me wondering. Were they simply not able to write a driver that worked with the new spec? Were they planning to drop Mac support? Or were they just behind schedule?

There was one other funny thing happening. When I booted the Lion machine, the Accounts/Login screen appeared, and I could move my cursor around with the mouse. But mouse clicks were ignored. It was impossible to log in! If I shut down the Drobo FS, I was able to log in to the Lion machine. I suspect some background process was hung, waiting for the Drobo. This happened every single time I booted the Lion machine.

On Monday, 4 days after Lion was released, Drobo posted a new version (1.2.0) of the firmware for the FS. Their release notes claimed that it fixed the Time Machine incompatibility. I installed it last night, and things got much worse.

With the new version of the Drobo FS firmware, neither of my machines that are still running Snow Leopard can reliably connect to the Time Machine backup shares. One machine, a Mac Mini that sees very little filesystem activity, takes 20 to 30 minutes to connect to the Drobo. The 5 to 8 MB of data transfer takes about 30 minutes. The other machine, a MacBook Pro, has not yet managed to connect to the FS for a backup. I am seeing frequent Finder freezes (every few minutes) on both Snow Leopard and Lion, and occasional crashes that require a hard reboot.

If you have machines that will remain on Snow Leopard for a while, I suggest that you not install the latest Drobo FS firmware, because it will break Time Machine and make your Finder creep. If you’re in an all-Lion environment, then it doesn’t matter which version of the firmware you use, since neither of them works.

One Twitter follower suggested the Promise DS 4600 as a Drobo replacement. I’m going to give Drobo a few more days before I give up on them completely. I have Backblaze installed on all the machines too, so it’s not a crisis for me yet.

The creation date of the latest FS firmware is July 19, two days before Lion was released, and 6 days before the firmware was released. I have a hunch that things are very busy at Drobo right now.

UPDATE Friday, July 29

I just called Drobo Support. The technician acknowledged that the 1.2.0 firmware does not solve Lion compatibility. He gave me instructions for downloading and running the test suite at http://www.aja.com/ajashare/AJA_System_Test_v601.zip to measure network performance. I haven’t done that yet.

One commenter suggested downgrading the firmware to the previous version. The Drobo support person told me that that would require the datapack to be reset, resulting in total data loss.

My personal advice is still that you skip the 1.2.0 firmware update and wait for their next attempt.