Scraping of Reddit using Scrapy: Python. Web scraping is a process to gather bulk data from internet or web pages. The data can be consumed using an API. But there are sites where API is not provided to get the data. During this condition, we can use Web Scrapping where we can directly connect to the webpage and collect the required data. I'm fairly technical and have 10+ years coding experience. I made a web scraper to get stock off of a website and it was working great for a few weeks. However, the website recently underwent some changes to prevent bots. One of the main issues I'm having now is using a rotating IP list.
- Reddit Web Scraper- Now extract WallStreetBets data with ease Our prebuilt Reddit web scraper lets you extract business data, reviews and various alternate forms of data, quickly and easily, from numerous listings without having to write any code. Why should you consider scraping Reddit.
- In this part of our Web Scraping – Beginners Guide tutorial series we’ll show you how to scrape Reddit comments, navigate profile pages and parse and extract data from them. Let’s continue from where we left off in the previous post – Web scraping Guide: Part 2 – Build a web scraper for Reddit using Python and BeautifulSoup.
- It looks like Reddit is putting the titles inside “h2” tags. Let’s use Cheerio.js to extract the h2 tags from the page. Output: Additional Resources. And there’s the list! At this point you should feel comfortable writing your first web scraper to gather data from any website.
The latest version for this tutorial is available here. Go to have a check now!
In this tutorial, we are going to show you how to scrape posts from a Reddit group.
To follow through, you may want to use this URL in the tutorial:
We will open every post and scrape the data including the group name, author, title, article, the number of the upvote and that of the comments.
This tutorial will also cover:
· Handle pagination empowered by scrolling down in Octoparse
· Deal with AJAX for opening every Reddit post
· Locate all the posts by modifying the loop mode and XPath in Octoparse
Here are the main steps in this tutorial: [Download task file here ]
1) Go To Web Page - to open the targeted web page
· Click '+ Task' to start a task using Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like Airbnb.com, we strongly recommend Advanced Mode to start your data extraction project.
· Paste the URL into the 'Extraction URL' box and click 'Save URL' to move on
2) Set Scroll Down - to load all items from one page
· Turn on the 'Workflow Mode' by switching the 'Workflow' button in the top-right corner in Octoparse
We strongly suggest you turn on the 'Workflow Mode' to get a better picture of what you are doing with your task, just in case you mess up with the steps.
· Set up Scroll Down
For some websites like Reddit.com, clicking the next page button to paginate is not an option for loading content. To fully load the posts, we need to scroll the page down to the bottom continuously.
· Check the box for 'Scroll down to bottom of the page when finished loading'
· Set up 'Scroll times', 'Interval', and 'Scroll way'
By inputting value X into the 'Scroll times' box, Octoparse will automatically scroll the page down to the bottom for X times. In this tutorial, 1 is inputted for demonstration purposes. When setting up 'Scroll times', you’ll often need to test running the task to see if you have assigned enough times.
'Interval' is the time interval between every two scrolls. In this case, we are going to set 'Interval' as 3 seconds.
For 'Scroll way', select 'Scroll down to the bottom of the page'
· Click 'OK' to save
Tips! To learn more about how to deal with infinite scrolling in Octoparse, please refer to: · Dealing with Infinite Scrolling/Load More |
3) Create a 'Loop Item' - to loop click into each item on each list
· Select the first three posts on the current page
· Click 'Loop click each element' to create a 'Loop Item'
Octoparse will automatically select all the posts on the current page. The selected posts will be highlighted in green with other posts highlighted in red.
· Set up AJAX Load for the 'Click Item' action
Reddit applies the AJAX technique to display the post content and comments thread. Therefore, we need to set up AJAX Load for the 'Click Item' step.
· Uncheck the box for 'Retry when page remains unchanged (use discreetly for AJAX loading)' and 'Open the link in new tab'
· Check the box for 'Load the page with AJAX' and set up AJAX Timeout (2-4 seconds will work usually)
· Click 'OK' to save
Tips! For more about dealing with AJAX in Octoparse: · Deal with AJAX |
4) Extract data - to select the data for extraction
After you click 'Loop click each element', Octoparse will open the first post.
· Click on the data you need on the page
· Select 'Extract text of the selected element' from 'Action Tips'
· Rename the fields by selecting from the pre-defined list or inputting on your own
5) Customize data field by modifying XPath - to improve the accuracy of the item list (Optional)
Once we click “Loop click each element”, Octoparse will generate a loop item using Fixed list loop mode by default. Fixed list is a loop mode used for dealing with a fixed amount of elements. However, the number of posts on Reddit.com is not fixed but increases with scrolling down. In order to enable Octoparse to capture all the posts, including those to be loaded later, we need to swift the loop mode to Variable list and enter the proper XPath to have all the posts to be located.
· Select 'Loop Item' box
· Select 'Variable list' and enter '//div[contains(@class, 'scrollerItem') and not(contains(@class, 'promote'))]'
· Click 'OK' to save
Tips! 1. 'Fixed list' and 'Variable list' are loop modes in Octoparse. For more about loop modes in Octoparse: ·5 Loop Modes in Octoparse 2. If you want to learn more about XPath and how to generate it, here is a related tutorial you might need: ·Locate elements with XPath |
6) Start extraction - to run the task and get data
· Click “Start Extraction” on the upper left side
· Select “Local Extraction” to run the task on your computer, or select “Cloud Extraction” to run the task in the Cloud (for premium users only)
Here is the sample output.
Was this article helpful? Feel free to let us know if you have any question or need our assistance.
Contact ushere !
The goal is to extract or “scrape” information from the postson the front page of a subreddit e.g. http://reddit.com/r/learnpython/new/
You should know that Reddit has an apiand PRAW exists to make using it easier.
- You use it, taking the blue pill—the article ends.
- You take the red pill—you stay in Wonderland, and I show you how deep a
JSON
response goes.
Remember: all I’m offering is the truth. Nothing more.
Reddit allows you to add a .json
extension to the end of your request and will give you back a JSON
response instead of HTML
.
We’ll be using requests
as our “HTTP client” which you can install using pip install requests --user
if you have not already.
We’re setting the User-Agent
header to Mozilla/5.0
as the default requests
value is blocked.
r.json()
We know that we’re receiving a JSON
response from this request so we use the .json()method on a Response object which turns a JSON
“string” intoa Python structure (also see json.loads())
To see a pretty-printed version of the JSON
data we can use json.dumps()
with its indent
argument.
The output generated for this particular response is quite largeso it makes sense to write the output to a file for further inspection.
Note if you’re using Python 2
you’ll need from __future__ import print_function
to have access to the print()
function that has the file
argument (or you could just usejson.dump()
).
Upon further inspection we can see that r.json()['data']['children']
is a list of dicts and each dict represents a submission or “post”.
There is also some “subreddit” information available.
These before
and after
values are used for result page navigationjust like when you click on the next
and prev
buttons.
To get to the next page we can pass after=t3_64o6gh
as a GET param.
When making multiple requests however, you will usually want to use a session object.
So as mentioned each submission is a dict and the important information is available inside the data
key:
I’ve truncated the output here but important values include author
,selftext
, title
and url
It’s pretty annoying having to use ['data']
all the time so we could have instead declared posts
using a list comprehension.
One example of why you may want to do this perhaps is to “scrape” the linksfrom one of the “image posting” subreddits to access the images.
r/aww
One such subreddit isr/aww home of “teh cuddlez”.
Some of these URLs would require further processing though as not all of them are direct links to images and not all of them are images.
In the case of the direct image links we could fetch them andsave the result to disk.
BeautifulSoup
You could of course just request the regular URL, processing the HTML
with BeautifulSoup
and html5lib
which you can install usingpip install beautifulsoup4 html5lib --user
if you do not already have them.
BeautifulSoup’s select()
method locates items using CSS Selectorsand div.thing
here matches <div>
tags that contain thing
as a class namee.g. class='thing'
We can then use dict indexing on a BeautifulSoup
Tag object to extract the value of a specific tag attribute.
In this case the URL is contained in the data-url='...'
attribute of the <div>
tag.
Web Scraper Free
As already mentioned Reddit does have an API withrules / guidelines and if you’re wanting to do any type of “large-scale” interaction with Reddit you should probably use it via the PRAW library.