farhaan: File Indexing In Golang

I have been working on a pet project to write a File Indexer, which is a utility that helps me to search a directory for a given word or phrase.

The motivation behind to build this utility was so that we could search the chat log files for dgplug. We have a lot of online classes and guest session and at time we just remember the name or a phrase used in the class, backtracking the files using these are not possible as of now. I thought I will give stab at this problem and since I am trying to learn golang I implemented my solution in it. I implemented this solution over a span of two weeks where I spent time to upskill on certain aspects and also to come up with a clean solution.

Exploration

This started with exploring a similar solution because why not? It is always better to improve an existing solution than to write your own. I didn’t find any which suits our need so I ended up writing my own. The exploration to find a solution led me to discover few of the libraries that can be useful to us. I discovered fulltext and Bleve.

I found bleve to have better documentation and really beautiful thought behind it. They have a very minimal yet effective thought process with which they designed the library. At the end of it I was sure I am going to use it and there is no going back.

Working On the Solution

After all the exploration I tried to break the problem I have into smaller problems and then to follow and solve each one of them. So first one was to understand how bleve works, I found out that bleve creates an index first for which we need to give it the list of files. The way the index is formed is basically a map structure behind the back where you give the id and content to be indexed. So what could be a unique constraint for a file in a filesystem? The path of the file I used it as the id to my structure and the content of my file as the value.

After figuring this out I wrote a function which takes the directory as the argument and gives back the path of each file and the content of each file. After few iteration of improvement it diverged into two functions one is responsible to get the path of all the files and the other just reads the file and get the content out.

func fileNameContentMap() []FileIndexer {
	var ROOTPATH = config.RootDirectory
	var files []string
	var filesIndex FileIndexer
	var fileIndexer []FileIndexer

	err := filepath.Walk(ROOTPATH, func(path string, info os.FileInfo, err error) error {
		if !info.IsDir() {
			files = append(files, path)
		}
		return nil
	})
	checkerr(err)
	for _, filename := range files {
		content := getContent(filename)
		filesIndex = FileIndexer{Filename: filename, FileContent: content}
		fileIndexer = append(fileIndexer, filesIndex)
	}
	return fileIndexer
}

This forms a struct which stores the name of the file and the content of the file. And since I can have many files I need to have a array of the struct. This is how the transition of moving from a simple data structure evolves into complex one.

Now I have the utility of getting all files, getting content of the file and making an index.

This forms a crucial step of what we are going to achieve next.

How Do I Search?

Now since I am able to do the part which prepares my data the next logical stem was to retrieve the searched results. The way we search something is by passing a query so I duck-typed a function which accepts a string and then went on a spree of documentation to find out how do I search in bleve, I found a simple implementation which returns me the id of the file which is the path and match score.

 func searchResults(indexFilename string, searchWord string) *bleve.SearchResult {
	index, _ := bleve.Open(indexFilename)
	defer index.Close()
	query := bleve.NewQueryStringQuery(searchWord)
	searchRequest := bleve.NewSearchRequest(query)
	searchResult, _ := index.Search(searchRequest)
	return searchResult
}

This function opens the index and search for the term and returns back the information.

Let’s Serve It

After all that is done I need to have a service which does this on demand so I wrote a simple API server which has two endpoints index and search.  The way mux works is you give the enpoint to the handler and which function has to be mapped with it. I had to restructure the code in order to make this work. I faced a very crazy bug which when I narrowed down came to a point of a memory leak and yes it was because I left the file read stream open so remember when you Open always defer Close.

I used Postman to heavily test it and it war returning me good responses. A dummy response looks like this:

 [{"index":"irclogs.bleve","id":"logs/some/hey.txt","score":0.6912244671221862,"sort":["_score"]}]

Missing Parts?

The missing part was I didn’t use any dependency manager which Kushal pointed out to me so I landed up using dep to do this for me. The next one was the best problem and that is how do auto-index a file, which suppose my service is running and I added one more file to the directory, this files content wouldn’t come up in the search because the indexer has not run on it. This was a beautiful problem I tried to approach it from many different angles first I thought I would re-run the service every time I add a file but that’s not a graceful solution then I thought I would write a cron which will ping /index at regular interval and yet again that was a bad option, finally I thought if I could detect the change in file. This led me to explore gin, modd and fresh.

Gin was not very compatible with mux so didn’t use it, modd was very nice but I need to kill the server to restart it since two service cannot run on a single port and every time I kill that service I kill the modd daemon too so that possibility also got ruled out.

Finally the best solution was fresh although I had to write a custom config file to suite the requirement this still has issues with nested repository indexing which I am thinking how to figure out.

What’s Next?

This project is yet to be containerised and there are missing test cases so I would be working on them as and when I get time.

I have learnt a lot of new things about filesystem and how it works because of this project, this helped me appreciate a lot of golang concepts and made me realise the power of static typing.

If you are interested you are welcome to contribute to file-indexer. Feel free to ping me.

Till then, Happy Hacking!

 


Source From: fedoraplanet.org.
Original article title: farhaan: File Indexing In Golang.
This full article can be read at: farhaan: File Indexing In Golang.

Advertisement


Random Article You May Like

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*