{"id":5,"date":"2021-12-05T17:01:06","date_gmt":"2021-12-05T17:01:06","guid":{"rendered":"https:\/\/blogs.oregonstate.edu\/scrapy\/?p=5"},"modified":"2021-12-05T17:01:06","modified_gmt":"2021-12-05T17:01:06","slug":"introduction-to-scrapy","status":"publish","type":"post","link":"https:\/\/blogs.oregonstate.edu\/scrapy\/2021\/12\/05\/introduction-to-scrapy\/","title":{"rendered":"Introduction To Scrapy"},"content":{"rendered":"\n<p>Hello ! This blog will introduce you, the potential new user, to Scrapy. <\/p>\n\n\n\n<p>Scrapy is an open source project in Python for web crawling and web scraping. I have personally used this website to scrape data in mass and create price \/ in stock alerts for certain items I wish to buy. I have found Scrapy easy to use and generally a solid open source project to support.The one feature I really wanted \/ would like to see was documentation in video form. I personally learn better with the video format and have created this blog and accompanying videos to help others in the same boat.<\/p>\n\n\n\n<p>Official Scrapy: https:\/\/github.com\/scrapy\/scrapy<\/p>\n\n\n\n<p>Official Scrapy Documentation: https:\/\/docs.scrapy.org\/en\/latest\/<\/p>\n\n\n\n<p>Part 1 &#8211; Install and run first scrape:  <\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Scrapy Tutorial Pt 1\" width=\"750\" height=\"422\" src=\"https:\/\/www.youtube.com\/embed\/tjDyEypbohA?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p>If you followed the above video correctly, you should see two new files created: quotes-1.html and quotes-2.html. You will notice the two output files are HTML files. In next section, we will move into data extraction.<\/p>\n\n\n\n<p>Part 2 &#8211; Extracting data to JSON file:<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Scrapy Tutorial Pt 2\" width=\"750\" height=\"422\" src=\"https:\/\/www.youtube.com\/embed\/iUxnZWZnpmY?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p>From here, you could feed the JSON file into an alert system or any other program to suit your needs. <\/p>\n\n\n\n<p>But wait, what if I want to extract data over numerous pages and do not want to set the URLS in start_urls?? <\/p>\n\n\n\n<p>Part 3 &#8211; How to extract data recursively:<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Scrapy Tutorial Pt 3\" width=\"750\" height=\"422\" src=\"https:\/\/www.youtube.com\/embed\/aKn8WsQT8Tg?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p>This methodology is useful for extracting data from sites with numerous pages such as government websites with 50+ pages.<\/p>\n\n\n\n<p>As you can see, Scrapy is an easy to use tool for web scraping and web data extraction use. I hope you consider Scrapy for your next project. Please visit the official website and official documentation page linked above for additional info! <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hello ! This blog will introduce you, the potential new user, to Scrapy. Scrapy is an open source project in Python for web crawling and web scraping. I have personally used this website to scrape data in mass and create price \/ in stock alerts for certain items I wish to buy. I have found&hellip; <a class=\"more-link\" href=\"https:\/\/blogs.oregonstate.edu\/scrapy\/2021\/12\/05\/introduction-to-scrapy\/\">Continue reading <span class=\"screen-reader-text\">Introduction To Scrapy<\/span><\/a><\/p>\n","protected":false},"author":11556,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-5","post","type-post","status-publish","format-standard","hentry","category-uncategorized","entry"],"_links":{"self":[{"href":"https:\/\/blogs.oregonstate.edu\/scrapy\/wp-json\/wp\/v2\/posts\/5","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.oregonstate.edu\/scrapy\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.oregonstate.edu\/scrapy\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/scrapy\/wp-json\/wp\/v2\/users\/11556"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/scrapy\/wp-json\/wp\/v2\/comments?post=5"}],"version-history":[{"count":11,"href":"https:\/\/blogs.oregonstate.edu\/scrapy\/wp-json\/wp\/v2\/posts\/5\/revisions"}],"predecessor-version":[{"id":16,"href":"https:\/\/blogs.oregonstate.edu\/scrapy\/wp-json\/wp\/v2\/posts\/5\/revisions\/16"}],"wp:attachment":[{"href":"https:\/\/blogs.oregonstate.edu\/scrapy\/wp-json\/wp\/v2\/media?parent=5"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/scrapy\/wp-json\/wp\/v2\/categories?post=5"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/scrapy\/wp-json\/wp\/v2\/tags?post=5"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}