Navigating an AI project

With generative AI expanding the possibilities of AI applications, it may be hard to fathom what route to follow in what seems to be an AI project. This is a problem that Product Owners and their sponsors face increasingly often. The three broad approaches to choose from are “Data but no AI”, “ready-made tool”, and “own machine learning project”. All have pros and cons.

Even in this new situation, most old hard facts and wisdom still hold and guide the way: It is critical to decompose the problem at hand into clear and easy enough subproblems. Appropriate tools must be then chosen for each. I hope this article helps you make the right choices.

Recap: What is AI?

AI is essentially applied statistics. It is important to ask if statistics is the right tool for the case at hand. Risking the reputation of a lousy AI evangelist, I must recommend the following principle:

If there is first-hand information that is known for certain, use it.

This may mean a low-glamour data acquisition and storage project instead of a shiny AI exploration project, but statistical estimates can never surpass the certainty of first-hand information. It makes sense to invest in making it available and using it first.

So, there is no certainty

There are several situations where one just cannot have certain information when needed.

There is no form to read the semantically clear text of choices from - instead comes a lengthy free-form text.
Images must be analyzed automatically at a fast pace to classify them correctly.
One must predict not-yet-known things, like energy consumption in a house on the next day.

In such cases, actions must be decided from a statistical analysis of the available material. Usually, the amount of material makes a computer the tool to use in the analysis, and much of this activity can be seen as Machine Learning (ML). In fact, most of the AI somehow involves ML. The principle to remember is:

ML needs training data.

The “training data” is a slight misnomer for annotated data with ground truth. This data is usually a pain point: It may be hard to gather it, there may be juridical considerations, it must be handled with care, documented with love, etc. Sometimes the required data does not - and even cannot - exist: e.g., credit risk realization cannot be known for cases where no loan is given. Data acquisition projects are tedious. It would be nice just to have the data handed down, ready, and clean, but that so seldom happens.

It would be a boon to have a ready-made tool. The first potential hurdle concerns the nature of our problem:

Is the problem a niche problem or a general problem?

If the problem is highly specific or involves non-public material, it is a niche problem. In such cases, no one is likely to offer a ready-made tool. On the other hand, if the problem is general enough, there may be a product or a service for it.

Is there a ready-made tool for this?

There’s a ready-made tool

This is really nice because somebody else has acquired the annotated material, trained, and possibly even hosts the solution. Sounds blissful.

The upsides are

A quick way to first experiments.
No up-front investment.
Reduced project risk.
There might even be performance guarantees.

Mind also the downsides

Not free - not necessarily cheap, either.
There’s usually little we can do to overcome limitations.
May be subtly unsuitable in the end.
Hard to reason about the inner workings of the tool.
Possible vendor lock-in.
Potentially complicated privacy concerns.

Own ML project

If there is no ready-made tool or if it is inadequate or too expensive to use, there’s the go/no go decision of an ML project at hand. The challenge of home-grown ML solutions is:

A Machine Learning solution cannot be guaranteed to attain a certain performance in a fixed time and amount of data. Can I tolerate this inherent uncertainty of an ML project?

An expert can assess the problem's difficulty and sometimes help decompose it into simpler subproblems, but problems usually reveal new facets along the way. Is this tolerable?

If yes, let’s think about data acquisition. If annotated data must be specifically prepared, it is hard work.

Annotated data is expensive. Am I ready to invest?

However, sometimes, the data can be used as a byproduct of the business-as-usual: users click buttons on a web page, experts extract information from documents, and developers write code. This makes the annotated data substantially cheaper and lowers the threshold of the “go” decision.

However, no matter how the data comes about, it may never suffice to solve the problem. The data accumulation rate may be too low, the quality may be too low, the performance objectives may be too stringent - or the problem just can be plain too hard.

Of course, the solution to the problem must be valuable enough to justify the toil.

The upsides of an own ML project