Blog Post 5

In my journey of creating an AI, I had to do a lot of research to determine what the right tools to get the job done were. I came across an interesting problem and that was data scraping. Unfortunately, it wasn’t as simple as I wished it would be.

I had to learn a few different technologies to try and solve this issue. First I tried implementing a prebuilt python package called beautiful soup, which would have solved my problem had the data I was looking to scrape be presented statically. With that I needed to pivot, so I looked into selenium and chrome webdriver for my second attempt. Selenium and chrome webdriver are great at scraping dynamic content, but I ran into blocking issues from the website I was pulling data from. So started the search for a way to build a proxy and get around those issues. This lead to me to my final solution scrapfly, scrapfly provides an automatic proxy and retooling my existing code to work within the platform was relatively easy. Scrapfly unfortunately is the only solution that isn’t free and thus I had to seek funding from my project mentor to continue using it after the free trial.

Google Cloud Platform is the main technology being used to host my AI/ML algorithm, it is relatively a simple platform to get acquainted with and navigate. I had little trouble getting used to hosting files on the platform and its use of BigQuery is so similar to mySQL that I had no problems writing queries to navigate my data. The vertexAI solution provided through GCP is an AI bot used to help developers build their own AI/ML algorithms and I am hopeful it won’t be too difficult to become familiar with once all the data has been retrieved and cleaned.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *