Converting unstructured text to a structured catalogue is one of the important steps to create and update catalogues for our buyers on the platform.
To give an example, a seller of an SKU like sugar will create a short message with information about the different stocks of various quality and prices accordingly. Then he will share the message in many WhatsApp groups.
The above is a sample message created by a sugar manufacturer, that he would broadcast on his network. He may update some details in the message e.g., price, availability, etc. every time there is a change and broadcast it again until he finds a suitable buyer.
Ninjacart receives thousands of these kinds of messages from vendors daily, and to enable easy discovery on the platform we have to create a structured catalogue from these messages in real-time without any human intervention.
The Technical Problem:
In the given sample image, the content is written in the form of unstructured text. To make it a catalogue listing in the marketplace, we need it to be in the form of structured entities, like variant, and price, all interlinked with each other. The below image is the structured output we want.
For the given sample text, it looks like a simple regex with fuzzy logic should solve the problem statement.
But wait, let’s see some more examples:
Now, let us add vernacular content to the mix.
Now, this doesn’t look easy!
As it is rather improbable to design an end-to-end ML model which can solve everything in our case, we have modularized this entire problem statement into multiple classical NLP problem statements, which pass data through pipelines to deliver the end result.
We have used fasttext’s language identification module, which was working for us without any changes to their existing model.
Language Translation and Transliteration:
In the problem statement given, we cannot completely do language translation because the word’s intent changes a lot if the literal translation is done.
In one of the examples, “हल्के फुल्के” is a variant of quality, but, if the translation is done on this, this will come out as “lightly” which might not be the intent of the message. So, in these cases, transliteration helps us a lot to keep the market’s vocabulary intact.
In the same way, we cannot simply use Language transliteration, because we want to understand the intent of the content to derive some meaningful entities out of it. To give an example, “बुधवार” is a day of the week, so translating this word gives us more information about the day on which the trade is happening. Even the prices are given in local languages sometimes.
So, which text content to be translated and which to be transliterated is another active problem that we are working on.
Tokenisation With Visual Structure Intact:
This is a slightly trickier piece of the pipeline if understood correctly, but, the foremost thing to consider here is the visual structure of the text. We need to add unique tokens to represent a few visual cues which represent the relationship between entities. We will expand more on this in part 2 of this blog post.
Named Entity Recognition:
This is a classical Named Entity Recognition approach, where we train a roBERTa on our custom dataset with custom entities. The entities, in this case, are rather customized to our problem statement. It could be SKU, or variation inside the SKU, price, and a lot of other things.
Entity Link Prediction:
Once the entities are identified, extracting the relationship between these different entities is a complicated problem statement. For instance, if the text contains two varieties of Onion, then the different varieties should match exactly with their respective prices. We will expand more on this in part 2 of this blog post.
In very obvious cases, the SKU keyword is present in the content. But there are many cases, where the content doesn’t contain information about what type of SKU these words are for. In these cases, we build a classification model to predict the SKU through all the keywords.
For Data annotation, we use a modified version of doccano internally. For all the tasks mentioned above, we have our custom annotations, which have been powered by doccano.
We will try to expand on each module in the next part of our blog post. Apart from this, we have a few NLP problem statements around De-Duplication and building an end-end custom transformer model to cater to every customized problem we have.
Do our unique problem statements sound interesting to you?
Join Ninjacart and work with us to solve for the future of Agritech. Please send your resume to email@example.com with the subject line “Data Science: NLP” to explore opportunities in Data Science.