Happy day! Yesterday I discovered a bug in the script I built to search my blog content (see my last post for details. My search script uses Python to parse my blog content, try to find matches to a list of input terms, and return the result. I noticed that one result, when clicked, led to a 404. Why?

My original script used the filename as a way to determine both the date and URL for a post. My Markdown files actually contain both the date and URL in their front matter, but I figured I didn't need to bother parsing that. Turns out, I was wrong. For reasons that are boring and not terribly important, it's possible the day referenced in a filename may not match the day used in the URL.

With that realization I decided to update my search script to support parsing the front matter so I could safely get the URL and date. There's multiple Python libraries for YAML, and I actually started going down that route with the thinking that I'd read in my blog post, split the front matter out, and then parse, when I did a bit more searching and came across python-frontmatter. This is a Python module that will parse a file (or string) of a page that includes front matter and give you both the front matter as a dictionary and the rest of the content as a string. It's perfect! So imagine this source:

---
title: My Cats
categories: ["cats","pets"]
description: Why cats are better than dogs.
---

This is the rest of the content.

After using pip to install python-frontmatter, you can use it like so:

import frontmatter

input = """
---
title: My Cats
categories: ["cats","pets"]
description: Why cats are better than dogs.
---

This is the rest of the content.
"""

data = frontmatter.loads(input)

print("Title", data["title"])
print("Categories", data["categories"])
print("Description", data["description"])

print("Content", data.content)

Keys in front matter can be addressed in dictionary-style in the result and .content can be used to reference the text after the front matter. Here's what the above outputs:

Title My Cats
Categories ['cats', 'pets']
Description Why cats are better than dogs.
Content This is the rest of the content.

Cool. So I began to modify my search script. First, I modified makeIndex to use the module:

def makeIndex(f):
	result = []
	for file in f:
		with open(file) as reader:
			
			content = reader.read()

			# to parse the yaml, we need to get just the front matter
			data = frontmatter.loads(content)

			result.append({
				"content":data.content,
				"date":data["date"],
				"path":data["permalink"]
			})

	return result

You can see it's a heck of a lot simpler than the previous version. The next modification was to the results display. My dates now included times which I didn't really need. I also modified how the URL was printed:

for result in result:
	# create url based on path
	url = "https://www.raymondcamden.com" + result["path"]
	# for printing the date, just need the part before T
	print(result["date"].split("T")[0],"->",url)

Here's sample output:

Example output from the search script.

Check out the docs for more information on what the module can do as I only used what was necessary for my script. The full version of my search script may be found here: https://github.com/cfjedimaster/raymondcamden2020/blob/master/search.py