A few months back, I took a look at using JSON-LD to turn a recipe web page into pure data: Scraping Recipes Using Node.js, Pipedream, and JSON-LD. This relied on a recipe actually using JSON-LD in the header to describe itself, which is pretty common for SEO purposes. Still, I was curious as to how well generative AI could solve this problem. In theory, this could be a good 'backup' in cases where a site wasn't using JSON-LD and a general exploration of 'parsing' a web page into data. I'll be using Google Gemini again, but in theory, this demo would work in other services as well. Here's what I found.
Converting a Web Page into Structured Data
In order to turn a web page into structured data, I needed a few different things. First, remember that Google's Gemini service supports the ability to use JSON Schema to tell the API how to return a result. (You can find my exploration of that feature here: Using JSON Schema with Google Gemini).
The code for this isn't difficult, it just becomes part of your request, but crafting the schema correctly can be a bit of work. As I suggested earlier this year, use the JSON Schema website for help and examples.
As I'm working with recipes, I defined my schema as such:
{
"description": "A recipe.",
"type": "object",
"properties": {
"name": {
"type":"string"
},
"ingredients": {
"type":"array",
"items": {
"type":"string"
}
},
"steps": {
"type":"array",
"items": {
"type":"string"
}
}
},
"required": ["name","ingredients","steps"]
}
This could be fleshed out more, for example, with a duration properly. I also could have attempted to coerce the ingredients into an array of objects containing the name of the ingredient and quantity. As always, take my blog posts as a starting point and if you build on it, let me know!
The next issue I ran into was actually getting the HTML. Gemini can't be told to go fetch a URL, but my code can. I initially attempted to take the HTML and simply append it to the prompt, but this caused issues. So, I took another approach - simply saving the HTML and uploading it to Gemini for a multimodal prompt. As a reminder, multimodal is just a fancy way of saying "prompt with an associated file or files", and again, I've got a blog post for that to help you: Using the Gemini File API for Prompts with Media
Given a string of HTML, here's a simple implementation:
const fileManager = new GoogleAIFileManager(API_KEY);
// Store to a file temporarily - note the hard coded path, should be a uuid
fs.writeFileSync('./test_temp.html', html, 'utf8');
const uploadResult = await fileManager.uploadFile('./test_temp.html', {
mimeType:'text/html',
displayName: "temp html content",
});
const file = uploadResult.file;
As the comment says, you should absolutely not use a hardcoded path, but rather something dynamic like a UUID. My demo doesn't even clean up the file, but I assume that's a trivial change if folks want to use my code. You may be wondering - can you skip the file system? Unfortunately no, not with the Node SDK. If you switched to using the REST API, you absolutely could and do a direct push, but that's quite a few more steps and not worth the effort, but it is possible.
Next, I designed my system instruction:
const si = `
You are an API that attempts to parse HTML content and find a recipe. You will try to find the name, ingredients, and
directions. You will return the recipe in a JSON object. If you are unable to find a recipe, return nothing.
`;
And my prompt, which is pretty boring:
Given the HTML content, attempt to find a recipe.
Using this recipe, https://www.allrecipes.com/recipe/10275/classic-peanut-butter-cookies, here's what I get back:
{
"ingredients": [
"1 cup unsalted butter",
"1 cup crunchy peanut butter",
"1 cup white sugar",
"1 cup packed brown sugar",
"2 large eggs",
"2.5 cups all-purpose flour",
"1.5 teaspoons baking soda",
"1 teaspoon baking powder",
"0.5 teaspoon salt"
],
"name": "Classic Peanut Butter Cookies",
"steps": [
"Gather all ingredients.",
"Beat butter, peanut butter, white sugar, and brown sugar with an electric mixer in a large bowl until smooth; beat in eggs.",
"Sift flour, baking soda, baking powder, and salt into a separate bowl; stir into butter mixture until dough is just combined. Chill cookie dough in the refrigerator for 1 hour to make it easier to work with.",
"Preheat the oven to 375 degrees F (190 degrees C). Roll dough into 1-inch balls and place 2 inches apart onto ungreased baking sheets. Flatten each ball with a fork, making a crisscross pattern.",
"Bake in the preheated oven until edges are golden, about 7 to 10 minutes.",
"Cool on the baking sheets briefly before removing to a wire rack to cool completely."
]
}
I put this into a simple web app where you could enter a URL, hit parse, and get the simpler version. This is a screenshot from the original, the complete page, where I cut out about 80% of the screenshot and it's still... a lot. Also notice the actual recipe isn't displayed in this portion.
Compared to my web app version:
I know which version I prefer. So, if you want to see the full code, you can find everything up at my repo: https://github.com/cfjedimaster/ai-testingzone/tree/main/recipe_scraper Unfortunately I can't run this live, but folks are free to take my code and run with it. My code is built to power a web app, but you could just as easily take the core logic and put it in a serverless function instead.