For the past few weeks my team and I have been working on a few tasks related to programmatically converting a giant word (.docx) document into JSON, and then utilizing that data in the investormatch.ai application. Initially, I thought the task of converting the word document into JSON would take at most a few days and would be relatively straight forward. Several road blocks delayed our progress a bit, but I think all of them could be solved with better communication as a team.
First I tried to directly convert the .docx into JSON. Unfortunately, it turns out that parsing a word document is extremely complex without using an external library (the binary of a word document is a mess). I then tried utilizing an npm package to convert the docx diretctly to JSON. This also was a failed attempt due to the formatting of the word document. The document consists of hundreds of unordered lists with n number of nested children (lists within lists). Unfortunately, when data was added to the document (before we started out Capstone), some of the indentings in the document were off (over or under-indented), some lists were missing bullet points altogether and just indented under their parent, and some lists were actually numbered rather than using bullet points. I then resorted to converting the word document into an html file first, then converting that html to JSON. I utilized an online .docx -> html parser. This was also not perfect due to the same formatting issues listed above, but worked better than parsing the raw .docx, even with an external library.
Once I had the html file (mostly) in the right format, parsing that into JSON was relatively straightforward using recursion to account for any number of nested children/lists. Unfortunately, it turns out all three members of the team worked on this same task for the better part of a week, so our progress was a bit limited. We met as a team and decided to have better communication and planning as a team what we would be working on so that we no longer overlap tasks and make more progress.
Since agreeing on splitting up the workload, we have been working on the next steps of our project and making more progress. I am working on a script that soft-deletes the existing data in the database, updates user data related to the JSON blob, and then inserts the new data into the database (ensuring no user data is lost in the process). My teammates are working on safely adding a large new list of data that was just provided to us by our PM in a slightly different format to the existing data and adding a user interface/GUI so that users are able to interact with this data in a meaningful way.
This process has taught us all the importance of effective communication, so that we are able to make more meaningful progress with our project, especially considering the tasks we are tackling are a bit more challenging than we had initially thought. We are using Slack primarily to communicate and have weekly Zoom meetings with our PM. I am looking forward to seeing what progress we will have made by the end of the week.
Leave a Reply