如何將非確定性AI輸出整合到傳統軟體系統中

Hacker News·3 個月前

本文探討了將非確定性AI組件整合到確定性軟體系統中所面臨的挑戰，並提出一種約束AI輸出以實現更好整合的方法，重點關注領域建模和策略設計。

AI Components for a Deterministic System (An Example)

by Eric Evans

When we set out to incorporate AI components into larger systems that are mostly conventional software, we encounter various difficulties. How do we wrangle behavior that is intrinsically non-deterministic so that it can be used in structured, deterministic systems? The flexibility of input is great! But the variation of output makes it difficult to do further processing by conventional software.

In this simple example I’ll characterize and constrain a non-deterministic result to make it usable in deterministic software. This leads into domain modeling and strategic design.

What follows isn’t rocket science, but it is the sort of basics I think we need to apply in order to get results.

1. A Question Conventional Code Can’t Easily Answer

Let’s start with a use-case I actually have. When I’m trying to get my bearings in a software system, I usually want to know what domains are addressed and in which parts of the code. So imagine an app that would generate that sort of view of a repo:

To be concrete, let’s look at the open source project “OpenEMR”. Here’s a very small code sample from that project:

We might ask, “what domains are addressed in this code?” Conventional code does not lend itself to that kind of question, but it is a natural use of an LLM.

An intelligent answer! But we couldn’t pass that to conventional software for further processing. Of course, we would instruct the LLM to structure and format its output.

Okay, so now we have an answer that could be integrated in a technical way. Yet this is will not support the comparisons and hierarchical roll-ups I was hoping for.

Because categories are chosen freely in each run, the classification of different files will not be easy to compare. To illustrate the point, I’ll repeat the same question using the same file. Every time I ask question I get a different answer:

The answers make sense individually but would be difficult to compare or combine.

Modeling Tasks vs Classification Tasks

The stochastic nature of LLMs can be a challenge in making reliable systems. However, in this case, I see it differently. – Assigning categories is a classification task, which LLMs are good at. – Creating the categorization scheme is a modeling task, which is fundamentally different.

We are giving the LLM a fairly difficult modeling task: model the relationship of a code sample to various kinds of business activities. It draws on its general knowledge of what software is typically used for and connects that with the the language and functionality of the code included in the context. Out of that comes a set of categories that fit this particular code sample, and probably generalizes a bit. But it would be very surprising if the various categories produced in this way fit together.

Classification, to be useful, must be somewhat repeatable. Modeling, when done well, produces a diverse range of possibilities. There are correct and incorrect answers in classification tasks. There is no “correct” model for a domain. If we want consistent categories for different code modules or layers of the hierarchy, we must select a model and use it throughout the process.

2. Create Canonical Categories, then classify.

Let’s separate the modeling task and the classification tasks into separate prompts. For the domain modeling task (the creation of the classification scheme), we want to give the LLM a broad view of the whole project. Maximally, we could put all the code in the project into the prompt, along with instructions to make a list of the domains being addressed (similar to the prompt as above). Using the entirety of a large code in the prompt would be expensive and might exceed the context window. In practice, we would probably get similar results from randomly sampling modules from the project.

However we do it, the generated taxonomy would be different every time we ran this prompt, so we would need to take one output and keep it as kind of canonical model. Then, this frozen category list could be included in a distinct prompt focused only on classification. This we would run independently for each file or module we wanted to analyze.

This would allow an application to relate different parts of the project or aggregate results hierarchically. Results still couldn’t be compared to an independent run of the software that didn’t use the same frozen categories.

3. Incremental Modeling

Up to this point, we’ve assumed that you have all the files at the outset, and that new things fit into the old categories. Sometimes we want our process to be more of a stream, where we can use new code samples to update an existing classification scheme incrementally.

One viable alternative to the all-in-one-prompt is to incrementally accumulate a classification by feeding files (or whatever sized chunks) to a prompt something like this:

So we can produce or update a schema automatically and then classify individual modules with it (and redo the classification of previously classified modules).

The results of this, I find, are hit and miss. Sometimes, the categorization scheme is just not very good. Part of that is because modeling is a very difficult task. We probably would want to use the biggest model available to us for this, whereas the classification might be done competently by a smaller model. There is another reason we might not get good results: Even for expert humans, creating good models calls for iteration.

4. Iteration

To get iteration refinement out of a model is a bit more complicated. There are various ways it can be done and people (so far only people!) keep coming up with new ones.

I’ve only played with these techniques. They are very interesting to me! However, for the particular use case I’m focused on here, the approach I’ll describe in the next section gave a better result. We have to have the self-discipline to choose the best solution even when it isn’t the coolest.

I’ll repeat this point: Creating a classification system is a modeling task, which is much harder than the classification task itself and calls for techniques like iteration and critique. There is no one model that such a process would converge on. Rather, we need to be clear about the intended uses for the resulting model. Those goals must be incorporated into the search and selection process. In other words, modeling is still modeling when an AI does it, and many of our techniques apply.

5. Using an Established Standard Model

Although identifying the domain of a software module is a bit niche, the idea of classifying business domains is not. In fact there are multiple well-established standard systems of categories — hierarchies of varying level of specificity. I considered a few and chose NAICS, using only the top (most general) level.

At first glance, this doesn’t seem better or even much different from the AI generated categories. However, it offers some significant advantages. These classification schemes have been used widely and shown to be broadly applicable. This fits the “Published Language” pattern from DDD. There is almost always some ambiguity in any classification, but far less in such mature models. As a result, the LLM’s classification output is more consistent. When we run this prompt multiple times on the same file, the results are similar:

Note that there is still variation, but it would be easy to filter out using the confidence level. In this case, the high-confidence categories (say anything above 80%) turned out to be stable enough to dispense with any reconciliation processes. This advantage would come from any comprehensive, low-ambiguity categorization scheme, whether created by humans or by LLM. but finding such a model is often harder than it looks. Battle-hardened, well-documented, published languages eliminate that task.

Using a standard classification also takes away our flexibility to choose our own model. Depending on the application, that may be an unacceptable tradeoff, but watch out for our bias toward believing we need a custom model! In the case of this application, the actual taxonomy of domains is really a “generic subdomain”. The core domain/differentiator is more related to the ability to automatically classify any sort of code, and possibly some aspects of how we look at the hierarchy and roll-up of the smaller parts into the larger parts (for example, if we could recognize context boundaries or tracking intended boundaries). When a subdomain is generic for an application, it is best treat it generically and use a standard model whenever possible.

Another advantage of a standard model, and especially a published language, is that it can make it easier to integrate with external systems. (However, that is not a known requirement of this particular application, so in this case I wouldn’t put much value on potential integration.)

Published languages have great advantages! They are worth looking for. Of course, even for fairly common generic subdomains, there are often no mature models available. In this case you’ll have to create your own categories. Using an LLM to do this is an option to seriously consider, but do expect to use a relatively large model, a large context window, and perhaps some iterative refinement. Some human review and editing.

If you are truly convinced that the classification is part of your core domain, then, as of 2025, I’d suggest having humans drive the modeling in an exploratory, iterative process such as the ones we’ve talked about for over 20 years. Once these carefully chosen categories are in place, an LLM will probably be a good classifier.

Contact Us

Send us a message! We’ll respond within two business days.

Virtual DDD

A community-driven platform for people who want to get more in-depth knowledge of Domain-Driven Design and solving complex business problems.

If you don’t live near an active Domain-Driven Design meetup, or just want to get more knowledge of DDD, please join this vast growing community!

Details here

你的個人知識庫