LM-Data-Tools – Rasmus Rasmussen dot com

LM Data Tools is a suite of tools for synthetic data generation that I’ve been developing since I started working with fine-tuning language models back in 2023. What started as a handful of independent python scripts now share a web interface (and API) that makes them easy to work with. In this post, I will go over what the different tools are, and give an example of how to use one of them.

Fine-tuning language models is most definitely a rabbit hole, but if you’re brand new to the concept, we can summarize it as a way to specialize a pre-trained AI model in some way. There are different approaches and techniques, and twice as many opinions about the best ones to use depending on the desired outcome.

If you’re not already a data scientist or engineer, it can seem overwhelming before you even begin. On that note, a huge shout-out to the Unsloth crew, whose colab notebooks were my introduction to getting hands-on with fine-tuning. Without them, learning this stuff would have taken much longer. The notebooks gave me a great starting point, and as I learned more about each aspect and parameter involved in the process, I could modify and tweak the notebooks to match.

Before I get too deep into the details, the reason for even building a tool set like this in the first place, was that I needed an straightforward way to get topic-specific data without wasting a bunch of time looking through public data sets, hoping to find exactly what I needed. I was learning about fine-tuning LLMs and data synthesis as part of that, and for me, the best way to learn something is to build it. In this case “it” was data.

Early Version

Despite this being a collection of tools, some of which have been around for a long time (by AI standards at least), bringing them together was not as easy as it may sound, and there are definitely still bugs to iron out but even so, I am very happy with it and use it enough, that I thought it was time to share LM Data Tools with you.

Tools in the Box

LM Data Tools consists of 6 core tools and 2 strict utilities. The core tools all involve generating new data, whereas the utilities only modify existing data sets. All the features are also FastAPI integrated, in case you want to incorporate any of them into other workflows. Without further ado, let me go over the tools included.

Tool	Description
DataBird	Give it a list of topics and optionally a list of user perspectives, and this tool will generate a number of questions and answers on those topics, from said perspectives. Can generate the perspectives as well.
DataPersona	This tool takes a list of prompts (existing data) and applies a persona to responses. It can write 1 or 2 replies per query, and you can have both or let the built-in evaluation feature pick the best one. The personas from this tool can be imbued into responses in other tools as well. You can add and edit personas directly via the web interface.
DataQA	This is a RAG tool. You feed it a list of URLs which are then scraped, and as with DataBird, you can provide specific user perspectives. A number of question and answer pairs are then generated from those perspectives, and based on the sources provided.
DataWriter	This tool was specifically made for pre-training purposes and will generate any number of made-up text documents, from blog posts to meeting summaries. The document mix is based off of a weighted list of topics.
DataConvo	Sometimes you want multi-round conversations to train on. This tool can take any single-round conversation data set and expand it into longer entries.
DataThink	This tool can add <think></think> blocks to existing data or generate data which already include reasoning blocks. If a persona is chosen, it will only be applied to the response part, not the reasoning.
DataMix	This utility lets you mix and match data sets from Huggingface to make a new, custom data mix.
Reformat	Need to convert from alpaca-format to ShareGPT, or the other way around? This utility reformats existing data sets without changing any of the prompts or responses.

Any Provider via OpenAI API

The tool set uses the OpenAI API to communicate with LLM servers, and you can connect to any provider that supports this API. If you’re running a local AI server, you can simply type in the server IP and use that. Your experience and the quality of the generated data will depend on your model of choice, of course.

I use the tools with LM Studio for locally hosted models and typically fall back on OpenRouter for API usage. OpenRouter often have “stealth models” that can be used for free, and for something like this, free compute is always welcome.

Example: Text Analysis

Let’s imagine that we want to fine-tune a model to help with text analysis. For this, we will use DataBird with the “Curious Analyst” persona.

Next, we are going to need some topics to base the data set around. You will get the best overall results if you base this list on topics that are already adjacent. For this example, we are going with “finding the meaning and intention behind prose“, “storytelling and plotting“, and “learning from text analysis“.

Let’s also add the following perspectives manually: “an English major struggling with text analysis“, “an author who needs help with their next book“, and “a technical writer who wants to start writing fiction“.

It’s as easy as that. The more topics and perspectives you add, the larger the data set you’ll end up with. After generation and quality evaluation, we are left with a fresh, small data set of 77 entries that can be used for fine-tuning any language model.

The example data set can be downloaded from Huggingface and the whole LM Data Tools suite is open source on github.

Tag: LM-Data-Tools

Introducing: LM Data Tools

Early Version

Tools in the Box

Any Provider via OpenAI API

Example: Text Analysis