Hosts Randy Johnston and Brian Tankersley, CPA, are joined by guest Will Zacher, co-founder of AIgency Partners, to discuss a recent study on how artificial intelligence fared when attempting to pass the CPA exam.
Use the video player below to watch, or the podcast player below to listen to the podcast.
–
Or use the below podcast player to listen:
Transcript (Note: There may be typos due to automated transcription errors.)
Brian F. Tankersley, CPA.CITP, CGMA 00:00
Welcome to the accounting Technology Lab sponsored by CPA practice advisor, with your hosts, Randy Johnston, and Brian Tankersley.
Randy Johnston 00:12
Today, welcome to the accounting Technology Lab. I’m your host, Randy Johnston with co host, Brian Tankersley. And we’re lucky enough to have a guest with us today, the wills Zacher, who we discovered with his company, agency partners had done some things on the CPA exam, and few findings that we thought were worthwhile. So Brian, I’m just reminded that you’ve taught the CPA exam content for a long time, just remind our listeners of your history there.
Brian F. Tankersley, CPA.CITP, CGMA 00:41
Yeah. So I started out teaching CPA, my first teaching job actually was the teaching CPA review in January of 1997, with Becker, and I worked with them till about 2013. And then starting about 2015, or 16, I did some work with Jaeger CPA review, including going to the the CPA exam provider meeting in New York, at AICPA. So I’ve, I’ve, I’ve got a long history of spending time with folks figuring out these kinds of problems. And it’s a, you know, it’s kind of how I got to this dance. Yeah, that I
Randy Johnston 01:19
knew you had some dance experience, if you will. So 25 ish years plus. And so will I’m not going to try to fill in an introduction for you, what would you like our listeners to know about you and your company?
01:32
Yeah, sure. So our company agency provides financial automation solutions for CPAs. And our, our working paper came out of some internal benchmark testing that we were doing, we realized it could be beneficial to the industry. So we decided to publish our findings making public. Yeah,
Randy Johnston 01:49
well, that’s I appreciate that. And, you know, what caught my eye on this was that some of your key findings were how Chet GPT. And of course, Claude Opus, had passed multiple choice sections of the CPA examined both auditing and financial reporting. That was kind of one of the key findings, what should we know about that of finding. So
02:17
they’re charged up for there’s a paper that came out last year in which the researchers were able to modify chat up to four to get it to pass the CPA exam. But our testing involves what’s called Zero sharp prompting, which means that we don’t provide any context to the model around what it’s going to solve. We just copied and pasted the questions in there and was able to solve it. And, you know, have passing scores on every section without any any manipulation by, you know, our team. So both of those models are good choices for asking, you know, general, general questions that have to do with the CPA exam. Chechi, PD for performed very well, in almost every section, cloud Opus performed well, extremely well and best in class for the regulatory section. But yet, both of those, both of those models perform well. And then Google Gemini actually performs extremely well in Business Analytics, which is a relatively newer section of the CPA exam. And one of the major takeaways there is that their model has a larger memory, almost 10x, the next models memory, in terms of the data that’s able to digest and use when it’s making its decision making decisions in production. So if you’re going to use that in an applied setting, Gemini pro might be a better solution than either cloud or, or
Randy Johnston 03:41
makes good sense. Now, Brian, and I, again, have followed these AI models since before inception, or before release. And, you know, as we take a look back and think about the Chet GPT, release in November of 22, and then how it all unfolded last year, of course, we thought that Microsoft’s copilot when it was released actually did a better, more accurate job than Chet GPT. Ford did at that point in time. And of course, when Bard was updated, now called Gemini in December, we thought it did better. And when cloud three was introduced, we thought he did better. And I’m not going to keep naming models. But you get the idea of friends, that when it comes to accounting related data, Brian and I are routinely testing these models on responses for queries in different sections. That’s why wills team’s results here. Were interesting because they were trying to be thoughtful about this. And, you know, as we get to conclusions from large language models into smaller models, which Brian and I’ve talked to you about in a prior podcast, as we go from large to medium to small to narrow models, they each do different things well, and we’re going to continue to see new announcements like Microsoft’s fi Recently, or the new search engines like perplexity, which are using these types of models to try to offset older technologies. So it looked like to me will that you had actually, you know, considered mixed rollin llama to be and some others against the Becker exam when you are getting this result. But you know, the second major conclusion, I think, is exactly the same one that we’ve been using, that humans have to monitor this AI result, human accountants are not going to be obsolete. So what would you have us know about that?
05:39
Yeah, absolutely. So the major application of AI in accounting is probably not going to be you know, people talk about replacing jobs and things like that, it’s, it’s unlikely to be that it’s much more likely to allow for manual automation, and reconciliation accounts and things like that. Because current automation tools are much more rule based, and they’re very rigid, whereas a lot of county decisions tend to be a little bit more contextual. And so these models have the capability of translating those contextual requirements into audit automations, at the, you know, the most time intensive tasks that really don’t deserve the time of someone who has a CPA or, you know, a CFO, controller type type role. We think that’s going to be the biggest area of improvement where people are going to see AI affect accounting the most.
Randy Johnston 06:30
And one of the observations that Brian and I have made is that generally, the large language models don’t do math. Well, you know, a lot of people have been trying to solve that problem. And in another accounting Technology Lab, we talked about digits, which we think was one of the first products that actually was trying to solve the math problem for reporting. And of course, we’ve talked in multiple other podcasts about for Impact Aid or aid or advisory tools. So you know, in terms of this, the, you know, math part of the exam, what what do you think your results were showing in the accuracy of number manipulation?
07:16
Yes, so the number part is more. These models are extremely good at writing Python code, which allows them to sort of get past a lot of the mathematical problems that they might run into. And then obviously, the CPA test exam, like the multiple choice questions aren’t extremely math intensive, so they can be solved with Python. So it’s like that, instead of these models, actually interpreting the math and understanding it, it’s writing Python code, and then executing that code to be able to solve these problems, which could also explain why some of the open source tooling might not have that functionality built into it, and therefore struggled, like llama be Allama to be on some of the more, you know, math intensive questions.
Randy Johnston 08:04
So in any case, so a lot of these models, as it turns out, we’ve recommended people are writing or using them to write scripts to write code and so forth. And they are actually being used by bad actors to do a lot of that today. It turns out that I think your take your third conclusion was that it was that Claude Opus was the strongest in the auditing sector. Now, again, I always turn to co host Brian on this stuff, because he introduces himself as a recovering auditor in many of our conversations, because he’d done both internal and external auditing. But what did you think of the pod open just did better in auditing compared to Chad GPT? Four, which was dominating the financial reporting or regulatory environment?
09:03
Well, I think the reason that it did better on auditing is because it had access, most likely it so anthropic is the company that put out cloud Opus, and they’ve got a partnership with Amazon, they largely they probably have larger financial datasets, and therefore have more, you know, training data that’s available. That’s related to, you know, audit audit related questions. A lot of these sections, all of these models. So if you look at the performance of cloud, Opus and shaggi before, it’s easy to compare them against human test takers, but when you look at these models performance on things like being able to understand multi step questions, they also test well for that, so it’s possible that because cloud OBUS and Chet GPT for both capable of interpreting multiple choice questions better than other models, they were able to perform better on these tasks because it was better you know, easier understand CPA test exam, you know, exam materials should be pretty standardized. And therefore, the access of true CPA training data for any of these models should be relatively similar across, across all the LLM is available. And therefore, a lot of the performance differences would come down to things like being able to chunk the questions into something that it can interpret and then retrieve the right responses for.
Randy Johnston 10:28
Yeah, so it does make sense to me, of course, I have a bit of a bias personally for the anthropic methodology because I liked the rails that they defined. Uh, you know, so the Constitution or rails of anthropic seem to be stronger than what Chad GPT is doing. Although, you know, we know that these models continue to leapfrog each other. So perhaps that guy did that. I am not sure on that conclusion. But, you know, so from that those were kind of the big three findings. What other things do you believe that you learned along the way of trying to test these models against the CPA exams?
11:16
Mixed will actually performed extremely well, on a lot of the sections that just had a wild amounts of variability that had by far the most variability in some cases, like up to 10%, in terms of its outputs. And so I know that there are some companies that are working on expanding the context window for MCs role. And the benefits of this could be significant cost savings for people who are using LMS is in production, because mixture is open source, it’s free to use, you can also choose the chip infrastructure that you run it on. So if you wanted to run it on LP use, like the grok infrastructure, you’ll get responses much quicker, which could enable you to perform accounting automation actions, much faster than the traditional API calls through GPD for or cloud Olympus. So I think there’s a huge opportunity with mixed roll and getting a model that’s been that can, as long as you can pass through the appropriate information to the prompt, you know, reduce the costs of utilizing AI in a in a production level environment.
Randy Johnston 12:20
And, you know, we’ll good point on the cost and production because with the discontinuance, of the Intel chipsets on May 24, for Gen 13 and the introduction of Gen 14 chipsets. You know, one of the things we’ve been asking all of our professionals to do in their future purchases, is look at and adopt neural processing units. In other words, the Intel Core Ultra, to handle some of this AI load distributed. Now, again, you wouldn’t know this about us. But I think as early as 2016, Brian and I began recommending GPUs, dedicated GPUs knowing that we put AI loads on them into the future, and that’s served us pretty well. But as we look at the new Apple M for chips, and the new phone chips, and so forth, the models are going to wind up running on the edge more. So, you know, I think your point is probably to some of the centralized resources that are available. So Brian, I know you were trying to get in a question, and I’ve kept you pretty quiet today. Sorry about that.
Brian F. Tankersley, CPA.CITP, CGMA 13:34
I was good. Well, I guess I guess well, the question I was going to ask is that Were there particular areas within each of the subject areas of the of the CPA exam? Course you did this on the legacy CPA exam? That, that not that just ended recently? Because that’s what you had data on. But are there? Were there particular subjects or topics that the AI was better or worse at? Or what kinds of trends Did you see within those, like, you know, the regulation piece has taxed and it has, it has PCAOB, and it has a number of other things in it. Talk to me a little bit about, about how the, how the AI engines handled different areas on the CPA exam.
14:23
Yeah, definitely. So AI models, almost all the AI models we tested on the bar part of the bar section of the exam performed exceedingly well like that. And actually, there was very little differentiation. So to give you an example, check up force scored 89% On average, Gemini was 88% and cloud Opus was 87% even mix draw, which had no you know, no fine tuning and or, you know, prompt engineering or anything, scored an 80% Out of the box, and so on, on business, business analytic This reporting, it’s going to your you can pretty much use any AI model for that. Whereas on, it was regular regulation, by far had some of the highest highest variability in terms of responses. And that one really, most AIS struggled with, which was interesting to us, especially because, you know, on regulation, that was the section that we found human test takers, seem to do the best on. Overall, there’s a pass rate of about 60%. Meaning that, you know, most people are probably scoring in the low 80s 80 percentile on that, on that test a low, it’s kind of difficult to tell, because we there’s not published the couldn’t find reliable sources for the scores of each individual person. But our assumption is that they’re relatively normally distributed around the pass rate. So the other area where we saw significant variation between models was audit and financial accounting, and reporting. Both of those are really separated between open source models really fell off at those in those two sections, whereas chip D, and cloud Opus, were able to perform the best in those areas. See here? Yeah, I mean, those are the those the other major takeaways. Okay, else.
Brian F. Tankersley, CPA.CITP, CGMA 16:32
Very interesting. Very interesting. So it’s, you know, so we were having a discussion before we get started to hear what’s kind of your your take on whether or not accounting professionals should be nervous about this, or which one should be nervous about the about the launch of AI?
16:54
I think, actually, accountants should be really excited for this, because there’s nothing worse than getting a list of transactions and datasets. And having to go back through and chase down the answers were for, you know, you’ve got these great accounting systems like Xero, or QuickBooks to they can store data. But the context around the transactions themselves up until this point have been stored by people by employees, and so people leave the company or whatever that information is gone. Whereas LMS, can embed that information directly into the transaction database. And so you can get answers around questions much easier and faster this way. So it’ll probably should improve the ability for, you know, accounts to perform audits, or especially fractional CFOs to come in and perform, you know, no, hit the ground running, when they start with the new company being able to get the information they need, instantly. And then also, the time savings of layer of being able to embed, you know, accounting automations, with the context around the transactions, to save time on manual tasks is going to be fantastic. So you’ll spend more time on modeling more time on strategy and much less time on, you know, simple, simple information retrieval. Very nice.
Randy Johnston 18:16
So will, you know, I, I know, also from our opening conversations that you’re, you know, you’re going back and forth between Buffalo and West Palm Beach, so, but I suspect it’s because you’re working on your upcoming product, AI agent, Donna, now, it looks like that’s going to do full accounting and automation. And we’ll integrate with some accounting systems and CRMs and ERPs. So what would you want our listeners to know about Donna?
18:45
Yeah, so Donna, as it comes out, is going to, we found a lot of people are love the API’s they’re not but the API’s can’t really do much for them. So what we’re doing is building out the technical infrastructure to allow accountants and CPAs to automate their transactions with with context. So be able to embed the context directly into their systems. We’re in a currently in a pilot right now, to do some very simple accounting. There’s simple accounting, transaction automation, which is just journal entry, pattern recognition. But the underlying technology should allow us to automate a lot of the accounting functions through that. And so the way the data is going to work is it we’ll be able to use the accounting software much like a lower level employee to perform actions with the context of your business and save, you know, we’re going for, hopefully save our days or weeks of time for you know, the accounting and finance teams.
Randy Johnston 19:48
All right, so it sounds like this tool should be useful to client accounting services groups, as well as accounting departments within industry businesses, as it’s released
19:59
Oh, yeah, definitely.
Brian F. Tankersley, CPA.CITP, CGMA 20:02
And what kind of timetable do you have on on this being the product, but going into beta or becoming commercially available?
20:11
We expect it to be commercially available June for July 1. So, we’re in, we’re in testing right now for the next two months, and then we should be able to release beginning of q3.
Randy Johnston 20:25
Wonderful. Well, our guest today has been will soccer from agency partners. Uh, you know, again, they’re a study on the CPA exam, we’ve read several of these and people denial, yeah, you can’t get the Lip Bar Exam or the medical exam or the CPA exam passed. But I think his study basically showed that you can with the legacy exams, we know we’re going to be going into a different time period here. But we wanted to have you hear his personal perspective on the AI models and the possibility. So we’ll thank you for your time today. And we wish you great success with your new product, Donna. And, again, all bar listeners have picked up an idea or two from you along the way.
21:14
Yeah, thank you, Randy. Thank you, Brian. Thanks for having me on. Appreciate it. All right, you guys.
Randy Johnston 21:18
Good day. All right.
Brian F. Tankersley, CPA.CITP, CGMA 21:21
Thank you for sharing your time with us. We’ll be back next Saturday with a new episode of the technology lab, from CPA practice advisor. Have a great week.
= END =
Thanks for reading CPA Practice Advisor!
Subscribe Already registered? Log In
Need more information? Read the FAQs