Reddit Web Scraper

broken image


Reddit Web Scraper- Now extract WallStreetBets data with ease Our prebuilt Reddit web scraper lets you extract business data, reviews and various alternate forms of data, quickly and easily, from numerous listings without having to write any code. Why should you consider scraping Reddit for the WallStreetBets Subreddit? 1) Go To Web Page - to open the targeted web page Click '+ Task' to start a task using Advanced Mode Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like Airbnb.com, we strongly recommend Advanced Mode to start your data extraction project.

  1. Web Scraper Chrome Extension
  2. Reddit Comment Scraper
Wednesday, December 04, 2019

The latest version for this tutorial is available here. Go to have a check now!

In this tutorial, we are going to show you how to scrape posts from a Reddit group.

To follow through, you may want to use this URL in the tutorial:

We will open every post and scrape the data including the group name, author, title, article, the number of the upvote and that of the comments.

This tutorial will also cover:

· Handle pagination empowered by scrolling down in Octoparse

· Deal with AJAX for opening every Reddit post

· Locate all the posts by modifying the loop mode and XPath in Octoparse

Here are the main steps in this tutorial: [Download task file here ]

1) Go To Web Page - to open the targeted web page

· Click '+ Task' to start a task using Advanced Mode

Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like Airbnb.com, we strongly recommend Advanced Mode to start your data extraction project.

· Paste the URL into the 'Extraction URL' box and click 'Save URL' to move on

2) Set Scroll Down - to load all items from one page

· Turn on the 'Workflow Mode' by switching the 'Workflow' button in the top-right corner in Octoparse

We strongly suggest you turn on the 'Workflow Mode' to get a better picture of what you are doing with your task, just in case you mess up with the steps.

· Set up Scroll Down

For some websites like Reddit.com, clicking the next page button to paginate is not an option for loading content. To fully load the posts, we need to scroll the page down to the bottom continuously.

· Check the box for 'Scroll down to bottom of the page when finished loading'

· Set up 'Scroll times', 'Interval', and 'Scroll way'

By inputting value X into the 'Scroll times' box, Octoparse will automatically scroll the page down to the bottom for X times. In this tutorial, 1 is inputted for demonstration purposes. When setting up 'Scroll times', you'll often need to test running the task to see if you have assigned enough times.

'Interval' is the time interval between every two scrolls. In this case, we are going to set 'Interval' as 3 seconds.

For 'Scroll way', select 'Scroll down to the bottom of the page'

· Click 'OK' to save

Tips!

To learn more about how to deal with infinite scrolling in Octoparse, please refer to:

· Dealing with Infinite Scrolling/Load More

3) Create a 'Loop Item' - to loop click into each item on each list

· Select the first three posts on the current page

· Click 'Loop click each element' to create a 'Loop Item'

Octoparse will automatically select all the posts on the current page. The selected posts will be highlighted in green with other posts highlighted in red.

· Set up AJAX Load for the 'Click Item' action

Reddit applies the AJAX technique to display the post content and comments thread. Therefore, we need to set up AJAX Load for the 'Click Item' step.

· Uncheck the box for 'Retry when page remains unchanged (use discreetly for AJAX loading)' and 'Open the link in new tab'

· Check the box for 'Load the page with AJAX' and set up AJAX Timeout (2-4 seconds will work usually)

· Click 'OK' to save

Tips!

For more about dealing with AJAX in Octoparse:

· Deal with AJAX

4) Extract data - to select the data for extraction

After you click 'Loop click each element', Octoparse will open the first post.

Web

· Click on the data you need on the page

· Select 'Extract text of the selected element' from 'Action Tips'

· Rename the fields by selecting from the pre-defined list or inputting on your own

5) Customize data field by modifying XPath - to improve the accuracy of the item list (Optional)

Once we click 'Loop click each element', Octoparse will generate a loop item using Fixed list loop mode by default. Fixed list is a loop mode used for dealing with a fixed amount of elements. However, the number of posts on Reddit.com is not fixed but increases with scrolling down. In order to enable Octoparse to capture all the posts, including those to be loaded later, we need to swift the loop mode to Variable list and enter the proper XPath to have all the posts to be located.

· Select 'Loop Item' box

· Select 'Variable list' and enter '//div[contains(@class, 'scrollerItem') and not(contains(@class, 'promote'))]'

· Click 'OK' to save

Tips!

1. 'Fixed list' and 'Variable list' are loop modes in Octoparse. For more about loop modes in Octoparse:

·5 Loop Modes in Octoparse

2. If you want to learn more about XPath and how to generate it, here is a related tutorial you might need:

·Locate elements with XPath

6) Start extraction - to run the task and get data

· Click 'Start Extraction' on the upper left side

· Select 'Local Extraction' to run the task on your computer, or select 'Cloud Extraction' to run the task in the Cloud (for premium users only)

Here is the sample output.

Was this article helpful? Feel free to let us know if you have any question or need our assistance.

Contact ushere !

Imagine you want to gather a large amount of data from several websites as quickly as possible, will you do it manually, or will you search for it all in a practical way?Now you are asking yourself, why would you want to do that! Okay, follow along as we go over some examples to understand the need for web scraping:

Introduction

  • Wego is a website where you can book your flights & hotels, it gives you the lowest price after comparing 1000 booking sites. This is done by web scraping that helps with that process.
  • Plagiarismdetector is a tool you can use to check for plagiarism in your article, it also is using web scraping to compare your words with thousands of other websites.
  • Another example that many companies are using web scraping for, is to create strategic marketing decisions after scraping social network profiles, to determine the posts with the most interactions.

Prerequisites

Before we dive right in, the reader would need to have the following:

  1. A good understanding of Python programming language.
  2. A basic understanding of HTML.

Now after having a brief about web scraping let's talk about the most important thing, that is the 'legal issues' surrounding the topic.

How to know if the website allows web scraping?

  • You have to add '/robots.txt' to the URL, such as www.facebook.com/robots.txt, so that you can see the scraping rules (for the website) and see what is forbidden to scrap.
Reddit

For example:

The rule above tells us that the site is doing a delay of 5 sec between the requests.

Another example:

On www.facebook.com/robots.txt you can find this rule listed above, it means that a Discord bot has the permission to do web scraping on Facebook videos.

  • You can run the following Python code that makes a GET request to the website server:

If the result is a 200 then you have the permission to perform web scraping on the website, but you also have to take a look at the scraping rules.

As an example, if you run the following code:

Download reddit images

· Click on the data you need on the page

· Select 'Extract text of the selected element' from 'Action Tips'

· Rename the fields by selecting from the pre-defined list or inputting on your own

5) Customize data field by modifying XPath - to improve the accuracy of the item list (Optional)

Once we click 'Loop click each element', Octoparse will generate a loop item using Fixed list loop mode by default. Fixed list is a loop mode used for dealing with a fixed amount of elements. However, the number of posts on Reddit.com is not fixed but increases with scrolling down. In order to enable Octoparse to capture all the posts, including those to be loaded later, we need to swift the loop mode to Variable list and enter the proper XPath to have all the posts to be located.

· Select 'Loop Item' box

· Select 'Variable list' and enter '//div[contains(@class, 'scrollerItem') and not(contains(@class, 'promote'))]'

· Click 'OK' to save

Tips!

1. 'Fixed list' and 'Variable list' are loop modes in Octoparse. For more about loop modes in Octoparse:

·5 Loop Modes in Octoparse

2. If you want to learn more about XPath and how to generate it, here is a related tutorial you might need:

·Locate elements with XPath

6) Start extraction - to run the task and get data

· Click 'Start Extraction' on the upper left side

· Select 'Local Extraction' to run the task on your computer, or select 'Cloud Extraction' to run the task in the Cloud (for premium users only)

Here is the sample output.

Was this article helpful? Feel free to let us know if you have any question or need our assistance.

Contact ushere !

Imagine you want to gather a large amount of data from several websites as quickly as possible, will you do it manually, or will you search for it all in a practical way?Now you are asking yourself, why would you want to do that! Okay, follow along as we go over some examples to understand the need for web scraping:

Introduction

  • Wego is a website where you can book your flights & hotels, it gives you the lowest price after comparing 1000 booking sites. This is done by web scraping that helps with that process.
  • Plagiarismdetector is a tool you can use to check for plagiarism in your article, it also is using web scraping to compare your words with thousands of other websites.
  • Another example that many companies are using web scraping for, is to create strategic marketing decisions after scraping social network profiles, to determine the posts with the most interactions.

Prerequisites

Before we dive right in, the reader would need to have the following:

  1. A good understanding of Python programming language.
  2. A basic understanding of HTML.

Now after having a brief about web scraping let's talk about the most important thing, that is the 'legal issues' surrounding the topic.

How to know if the website allows web scraping?

  • You have to add '/robots.txt' to the URL, such as www.facebook.com/robots.txt, so that you can see the scraping rules (for the website) and see what is forbidden to scrap.

For example:

The rule above tells us that the site is doing a delay of 5 sec between the requests.

Another example:

On www.facebook.com/robots.txt you can find this rule listed above, it means that a Discord bot has the permission to do web scraping on Facebook videos.

  • You can run the following Python code that makes a GET request to the website server:

If the result is a 200 then you have the permission to perform web scraping on the website, but you also have to take a look at the scraping rules.

As an example, if you run the following code:

If the result is a 200 then you have the permission to start crawling, but you must also be aware of the following Points:

  • You can only scrape data that is available to the public, like the prices of a product, you can not scrape anything private, like a Sign In page.
  • You can't use the scraped data for any commercial purposes.
  • Some websites provide an API to use for web scraping, like Amazon, you can find their APIhere.

As we know, Python has different libraries for different purposes. Merge gmail contacts to icloud.

In this tutorial, we are going to use Beautiful Soup4, urllib, requests, and plyer libraries.

For Windows users you can install it using the following command in your terminal:

For Linux users you can use:

You're ready to go, let's get started and learn a bit more on web scraping through two real-life projects.

Reddit Web Scraper

One year ago, I wanted to build a smart AI bot, I aimed to make it talk like a human, but I had a problem, I didn't have a good dataset to train my bot on, so I decided to use posts and comments from REDDIT.

Here we will go through how to build the basics of the aforementioned app step by step, and we will use https://old.reddit.com/.

First of all, we imported the libraries we want to use in our code.

Requests library allows us to do GET, PUT,. requests to the website server, and the beautiful soup library is used for parsing a page then pulling out a specific item from it. We'll see it in a practical example soon.

Second, the URL we are going to use is for the TOP posts on Reddit.

Third, the headers part with 'User-Agent' is a browser-related method to not let the server know that you are a bot and restrict your requests number, to find out your 'User-Agent' you can do a web search for 'what is my User-Agent?' in your browser.

Finally, we did a get request to connect to that URL then to pull out the HTML code for that page using the Beautiful Soup library.

Now let's move on to the next step of building our app:

Open this URL then press F12 to inspect the page, you will see the HTML code for it. To know in what line you can find the HTML code for the element you want to locate, you have to do a right-click on that element then click on inspect.

After doing the process above on the first title on the page, you can see the following code with a highlight for the tag that holds the data you right-clicked on:

Now let's pull out every title on that page. You can see that there is a 'div' that contains a table called siteTable, then the title is within it.

First, we have to search for that table, then get every 'a' element in it that has a class 'title'.

Now from each element, we will extract the text that is the title, then put every title in the dictionary before printing it.

After running our code you will see the following result, which is every title on that page:

Finally, you can do the same process for the comments and replies to build up a good dataset as mentioned before.

When it comes to web scraping, an API is the best solution that comes to the mind of most data scientists. APIs (Application Programming Interfaces) is an intermediary that allows one software to talk to another. In simple terms, you can ask the API for specific data by passing JSON to it and in return, it will also give you a JSON data format.

For example, Reddit has a publicly-documented API that can be utilized that you can find here.

Also, it is worth mentioning that certain websites contain XHTML or RSS feeds that can be parsed as XML (Extensible Markup Language). XML does not define the form of the page, it defines the content, and it's free of any formatting constraints, so it will be much easier to scrape a website that is using XML.

For example, REDDIT provides RSS feeds that can be parsed as XML that you can find here.

Let's build another app to better understand how web scraping works.

COVID-19 Desktop Notifer

Now, we are going to learn how to build a notification system for Covid-19 so we will be able to know the number of new cases and deaths within our country.

The data is taken from worldmeter website where you can find the COVID-19 real-time update for any country in the world.

Let's get started by importing the libraries we are going to use:

Here we are using urllib to make requests, but feel free to use the request library that we used in the Reddit Web Scraper example above.

We are using the plyer package to show the notifications, and the time to make the next notification pop up after a time we set.

In the code above you can change US in the URL to the name of your country, and the urlopen is doing the same as opening the URL in your browser.

Now if we open this URL and scroll down to the UPDATES section, then right-click on the 'new cases' and click on inspect, we will see the following HTML code for it:

We can see that the new cases and deaths part is within the 'li' tag and 'news_li' class, let's write a code snippet to extract that data from it.

After pulling out the HTML code from the page and searching for the tag and class we talked about, we are taking the strong element that contain in the first part the new cases number, and in the second part the new deaths number by using 'next siblings'.

In the last part of our code, we are making an infinite while loop that uses the data we pulled out before, to show it in a notification pop up.The delay time before the next notification will pop up is set to 20 seconds which you can change to whatever you want.

After running our code you will see the following notification in the right-hand corner of your desktop.

Conclusion

We've just proven that anything on the web can be scraped and stored, there are a lot of reasons why we would want to use that information, as an example:

Imagine you are working with a social media platform, and you have a task that is deleting any posts that may be against the community, the best way of doing that task is by developing a web scraper application that scrapes and stores the likes and comments number for every post, after that if the post received a lot of comments but without any like, we can deduce, that this particular post may be striking a chord in people and we should take a look at it.

Nintendo deals eshop. There are a lot of possibilities, and it's up to you (as a developer) to choose how you will use that information.

About the author

Web Scraper Chrome Extension

Ahmad Mardeni

Reddit Comment Scraper

Ahmad is a passionate software developer, an avid researcher, and a business man. He began his journey to be a cybersecurity expert two years ago. Also he participated in a lot of hackathons and programming competitions. As he says 'Knowledge is power' so he wants to deliver good content by being a technical writer.





broken image