Testing AI Help with Band Director Information

For years now, I’ve been using DEVONThink to keep track of lots of information. PDFs I want to read (and probably never will), resources I save like a packrat that I might one day send to a friend or colleague (this does happen on rare occasion) and then for managing some documents that need regular use across devices.

Once upon a time, Spotlight on my iPhone was good at finding things that I put in my Dropbox. Then, around 2019, it ceased to be useful at all. And being able to reference documents I’d saved on the go started requiring me to trudge through Dropbox’s app that I am not fond of. After some time trying different solutions to make this more manageable, I eventually jumped into DEVONThink. Now, while it has its own Spotlight feature, DEVONThink is also not great at being accessible that way, but it is a good app in spite of that. And beyond syncing things, it lets me get a URL directly to a folder1 or individual file. And I can then easily put those links in the notes of different events to manage all my information.

I have been satisfied with DEVONThink since picking it up, but DEVONThink 4 came out earlier this year. Its marquee feature has been its new ability to work with LLMs. I picked it up on sale, but I haven’t been doing a whole lot with AI in general. But I was motivated tonight to try to kick the tires on it. I haven’t done anything to date that required getting my own API key for AI (or paying for any AI at all).

I mostly wanted to test it on things I might need to know as a director. I keep a lot of information for honor bands in DEVONThink. Of note, I don’t keep copies of the emails I send in DEVONThink, though if I see myself using these AI features more in the future, I might keep them.2

I wanted to try to ask some things I’ve found myself having to look up more than once (that should just get a note somewhere else in my system to expedite the process). I tried a few models I hear about out in the wild on this through OpenRouter. This let me just pay a single source to manage API credits on different models, rather than having to give each separate company a bit of money to try. I had never heard of this before, so h/t to Christian Grunenberg on the DEVONThink forums.

It took a bit to figure out initially, but I decided to test the same handful of questions in my ‘Work’ database in DEVONThink and see how they did at answering them. Just out of personal interest, I also tried each one over in my ‘church’ database in DEVONThink. I was less objective here, giving each a different question just to see how they could use the info I’d saved in there effectively.

I decided to share the Markdown output of each chat on Pastebin. You can read the output yourself and judge. I used a few events that are in my area in Iowa, so only local directors would be able to really appreciate the accuracy of the details.

The models I tested were:

As a disclaimer, I’m talking specifically about how these interact with DEVONThink and its integrations. Not about how they apply to other tasks, about how they perform in the abstract, etc. I am commenting on the specific models used in this specific case. This will likely not be an accurate indicator to any of the models from these companies that succeed these ones.

Claude

Output

I’ve heard a lot about Claude, and I think it’s what my friends who actually work with code use. I, however, have not ever touched anything with Claude.

I wound up spending the most using this, but that was probably user error. I was still getting a handle on DT’s interface with AI features, and wound up sending a number of requests without being in the right part of my multiple databases.

The timestamps are there for you to evaluate the speed of each sample, but in terms of how it felt in DEVONThink’s interface, it definitely felt the fastest.

Of note, there was one question (the time limits for Large Group) that it got wrong, stating it was 20 minutes. It was the only one to get this wrong. Speed doesn’t count for a lot if accuracy falls short.

Grok

Output

Okay, so I know that, in a field with a lot of ethical concerns, Grok is far and away the most controversial and the latest issues are difficult to talk about. This small sample is simply me trying to get perspective on different models.

I will say, it was probably the most consistent about citing the sources in my DEVONThink database. While it had some inaccuracies (which I’ll note below), I think I liked its output the best. I don’t see myself using it, though.

OpenAI

Output

I don’t really know what GPT5.1 would have gotten me that might have been better, but I was just testing 5.

The main thing I noticed with it was how slow it was. It took 4 minutes to give me the answer to my Large Group timing question. It felt like it was actually slower than that. Even if it was more accurate, it would have definitely been faster for me to find the answers myself.

Gemini

Output

So I don’t know what’s wrong with the integration here, but this one was really bad.

  • It was hardly able to answer anything. For telling me about the Large Group time limits, it either just gave up or told me about the State Speech requirements.
  • It hallucinated in the most obvious way (there is no “red band” or “blue band” at the NEIBA Honor Band — or any honor band I have ever taken students to that would be mentioned in my database). I guess it’s always better when AI hallucinates in obvious ways, rather than subtle ways, though.
  • It acted as if it were not getting the correct scope, but I’m 90% sure I had things selected the same way I did everything else. It was asking me to select documents that didn’t exist to help it refine its scope.

Broad Notes

Not a single model was able to correctly get the time limits for our Solo & Small Ensemble Festival. Not a single one was entirely correct on my scheduling question for one of the honor bands.

The idea of pointing these models at my information in DEVONThink is really appealing in theory. Giving it a wide corpus of info to help put together my plans faster or refresh on things would be nice. But if I can’t trust it, then it’s useless. And if I’m paying by the question (all these queries combined to $1.94 of credits between the services) then I’d like the right answer every time.

I’d been well aware of the limitations of LLMs awhile back. But with all the hype and not doing a lot with them myself, I started assuming that they were starting to overcome them in more meaningful ways. With the acceptance of them in apps I really respect, I was hoping so, at least.

To be fair to everyone, DEVONThink’s aim with this AI is about helping find connections and parsing through research documents in a totally new way than what it’s always been able to do. And the searches I ran in my church database proved that it is good at highlighting topics and ideas in broad strokes. Even with over 400 PDFs of information for different things in my work database, it’s not going to surface things I forgot were there to help me find answers; it’s just going to risk giving me wrong ones out of the single PDF I know has the answer.

Ultimately, if I want to save the most time and energy, I’m better going through the PDFs myself. I’ll be using DEVONThink for that mostly the way I always have.


  1. DEVONThink calls these ‘groups,’ but I’ll just call them folders for the sake of simplicity. 
  2. By which I specifically mean copies of the emails I write and send. I have no place for using AI in my email writing, nor in reading (‘summarizing’) my emails for me 

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.