The Alt-Ac Job Beat Newsletter Post 9 2024-02-13

Hi everyone,

Going for a bit of a different format this week. Asking for experience in LLMs (Large-Language-Models) and generative AI is becoming much more prominent in private sector data science positions. I get why people are excited, Chat-GPT is neat -- I am not sure if this is a fad though or to stay. Legitimate business applications of these tools I don't think is quite aligned with the increase in job applications I am seeing. (For background, it is pretty common for lay-people I work with to have comments like "can't you use AI to do that", most people just don't have sufficient background knowledge to really understand predictive models or have a cogent reason to use LLMs.)

But given their prominence I think it is important to at least be familiar with them. And so here are my LLM cliff notes for data scientists.

It is all about the prompts

Chat-GPT is a generative AI LLM, you ask a question "tell me about X", and it will generate responses. It is trained on historical text data, using prior words to predict future words. (Deep learning models are not all that different from very large regression or structural equation models, the current large LLMs have billions of parameters. My suggestion to get feet wet in deep learning is implement some regression models you are familiar with in pytorch, here is an example for group based traj models.)

So if you ask "Give me an example of using X in a sentence" vs "The definition of X is blah blah blah. Give me an example of using X in a sentence." you will get two different responses. The additional words in the later example (the definition of X) is what is called context in LLMs. The different popular generative AI models have long context lengths (can include 500-1000 additional words and growing in the prompt), and so many different current LLM techniques make use of stuffing additional context into the prompt to improve responses. A few to be familiar with are:

RAG (Retrieval Augmented Generation): In these systems, you have a system to generate relevant query texts, and then insert those into the prompt. So for example, say your software application has very complicated technical documentation, and you ask "How do I replace widget X in the software with widget Y", you may perform a query on your documentation, and into the prompt insert those applicable pages before the prompt. E.g. "Doc 1: Widget X is .... | Doc 2: Widget Y is .... | Doc 3: How to replace widgets .... | How do I replace ....".

The system to query the relevant pages is separate from the generative AI LLM, and is often using semantic search using an embedding model. (I have a few notes on embedding models here.)

Zero-Shot Learning: Zero shot learning is the idea of using a general model to make specific predictions. So for example, you can ask Chat-GPT "Here is a sentence description of a medical narrative, {...narrative here...}, does this narrative suggest the person have a substance abuse problem. Answer only Yes or No".

One-Shot Learning: One shot learning is the same as zero-shot, you just add in examples of actual results in the prompt. So you could do "Narrative1: ...., Answer Yes | Narrative2: ...., Answer No | Here is a sentence description ..... Answer only Yes or No".

So these are all examples of taking a general model and adapting it to accomplish some specific goal. I have been impressed with RAG (although it is hard to quantify time savings of this, situations I have seen it used just building a good system to return the queries is easier and has close to the same benefit). Here is an example of RAG and zero shot application taking missing person narratives and classifying risk. (I think that is a good illustration of the idea, but I am quite skeptical that application could reach a level of accuracy to be a legitimate use case though. I would make a standard checklist based on 5-10 factors is how I would personally do that.)

RAG is a response to the idea of "models hallucinate", but not a 100% guarantee they won't generate text that is wrong. (These models again are just using prior text to predict future text, they are not all that different from a regression model.) Some applications in a separate step list the resources used to generate the response, which I think is good, see PerplexityAI for an example. Zero and one shot a common issue is that even if you tell the model "only answer Yes or No" it will still sometimes generate non yes or no responses.

Zero and one-shot I have not seen a legitimate use case in a business application (in which the generative AI would clearly outperform a supervised learning approach). I am sure they exist, I just have not come across one.

If you have historical data, training a model (such as using text featurizer in CatBoost, or predicting labeled data using old school smaller models, see Simpletransformers for example) is much simpler and tends to work quite well in my experience. A good example relevant to CJ is categorizing toxic comments in text. Identifying key words and using dummy variables is often not bad depending on the scenario.

Zero shot could be useful if you don't know what you are predicting at run time, which I am not familiar with scenarios in which that is the case. (If you wanted to classify substance abuse from text, IMO labeling 20k narratives and using a supervised model is a better strategy than putzing with zero shot. It may not seem that way at first but to build, evaluate, and iterate from the LLM takes more time than just labeling some data.)

Pains of using LLMs in practice

The majority of applications that use generative AI LLMs that I am familiar with use Open-AIs API (Application Programming Interface). You send a query to Open-AI, it processes that query and sends back a response on the internet.

For the majority of business applications in my day to day (healthcare and criminal justice), this is a total non-starter, since sending sensitive data over the internet is not OK. Here are my notes on that.

An alternative is deploying these models locally, Meta has released a somewhat competitive model to the OpenAI models, LLama. I get why so many applications use the OpenAI API though -- these models are so large they are difficult to deploy (have large GPU RAM requirements even for the small models, more like 8/16 gigs of GPU RAM minimum). Many of the examples of running these models use newer MACs, which have a shared GPU/CPU RAM set up so can run the large models.

For a good resource to follow on deploying and interacting with LLMs, I suggest to follow Simon Willison's blog. If you want to test models, a common way is to use a google collab online notebook (the Kaggle competition site also has online notebooks). Most of these models are hosted on HuggingFace, and they have minimal documentation on getting started with the models in python, see Mixtral for example.

Training the large LLMs is very difficult in my experimentation. I mean you can find tutorials online, but they require large GPU RAM requirements and time, and are not magic (it is easy to burn 40-80 hours putzing with training a model and it is junk compared to just out of the box ChatGPT). So "I want to train and run a local model for this vague idea" is a bad idea/big time waster in my experience!

I know I am somewhat of a cynic -- we will see in a year or two if the fad keeps rolling or expectations have come back down. But hopefully these few notes on LLMs can at least help you prep for these jobs in the market!

Best, Andy Wheeler