Skip to content
START FOR FREE
START FOR FREE
  • SUPPORT
  • COMMUNITY
Menu
  • SUPPORT
  • COMMUNITY
MENUMENU
  • Products
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      Watch a TigerGraph Demo

      TIGERGRAPH CLOUD

      • Overview
      • TigerGraph Cloud Suite
      • FAQ
      • Pricing

      USER TOOLS

      • GraphStudio
      • Insights
      • Application Workbenches
      • Connectors and Drivers
      • Starter Kits
      • openCypher Support

      TIGERGRAPH DB

      • Overview
      • GSQL Query Language
      • Compare Editions

      GRAPH DATA SCIENCE

      • Graph Data Science Library
      • Machine Learning Workbench
  • Solutions
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      Watch a TigerGraph Demo

      Solutions

      • Solutions Overview

      INCREASE REVENUE

      • Customer Journey/360
      • Product Marketing
      • Entity Resolution
      • Recommendation Engine

      MANAGE RISK

      • Fraud Detection
      • Anti-Money Laundering
      • Threat Detection
      • Risk Monitoring

      IMPROVE OPERATIONS

      • Supply Chain Analysis
      • Energy Management
      • Network Optimization

      By Industry

      • Advertising, Media & Entertainment
      • Financial Services
      • Healthcare & Life Sciences

      FOUNDATIONAL

      • AI & Machine Learning
      • Time Series Analysis
      • Geospatial Analysis
  • Customers
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      CUSTOMER SUCCESS STORIES

      • Ford
      • Intuit
      • JPMorgan Chase
      • READ MORE SUCCESS STORIES
      • Jaguar Land Rover
      • United Health Group
      • Xbox
  • Partners
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      PARTNER PROGRAM

      • Partner Benefits
      • TigerGraph Partners
      • Sign Up
      TigerGraph partners with organizations that offer complementary technology solutions and services.​
  • Resources
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      BLOG

      • TigerGraph Blog

      RESOURCES

      • Resource Library
      • Benchmarks
      • Demos
      • O'Reilly Graph + ML Book

      EVENTS & WEBINARS

      • Graph+AI Summit
      • Graph for All - Million Dollar Challenge
      • Events &Trade Shows
      • Webinars

      DEVELOPERS

      • Documentation
      • Ecosystem
      • Developers Hub
      • Community Forum

      SUPPORT

      • Contact Support
      • Production Guidelines

      EDUCATION

      • Training & Certifications
  • Company
    • Join the World’s Fastest and Most Scalable Graph Platform

      WE ARE HIRING

      COMPANY

      • Company Overview
      • Leadership
      • Legal Terms
      • Patents
      • Security and Compliance

      CAREERS

      • Join Us
      • Open Positions

      AWARDS

      • Awards and Recognition
      • Leader in Forrester Wave
      • Gartner Research

      PRESS RELEASE

      • Read All Press Releases
      TigerGraph Reports Exceptional Customer Growth and Product Leadership as More Market-Leading Companies Tap the Power of Graph
      March 1, 2023
      Read More »

      NEWS

      • Read All News
      The-New-Stack-Logo-square

      Multiple Vendors Make Data and Analytics Ubiquitous

      TigerGraph enhances fundamentals in latest platform update

  • START FREE
    • The World’s Fastest and Most Scalable Graph Platform

      GET STARTED

      • Request a Demo
      • CONTACT US
      • Try TigerGraph
      • START FREE
      • TRY AN ONLINE DEMO

Scraping Web Data into TigerGraph with Tor

  • Emily McAuliffe
  • March 31, 2021
  • blog, Community, Developers
  • Blog >
  • Scraping Web Data into TigerGraph with Tor

by Linxiu Jiang originally posted on Medium

1. Introduction

This blog will introduce the scraping demo I have recently done, and drill into some interesting ideas about encryption and decryption.

For the Scraping demo, I use Request library and Tor proxy to scrap user subscription information based on video platform www.bilibili.com, and then feed the data to TigerGraph database for relationship analysis between users. For encryption and decryption, I will introduce the Tor Network fundamental mechanism which is a very intuitional way to understand the concepts.

If you are interested in these contents, keep reading!

2. Scraping Demo

2.1. Anti-anti-spider

If you attempt to scrape data from web pages, there are many anti-spider mechanisms that could prevent you from information access. Before we start to deal with the bans, we should be aware of the most common anti-spiders as follows.

Async IO, User-Agent, Animating Authentication
Access per unit time, access volume

Async IO : users need to do something like scrolling the mouse to get the new response on the same webpage.

User-Agent: this will be used in the client request header

Animating Authentication: some servers assert if a user is a headless robot by playing simple games with it.

Access per unit time and access volume can be easily controlled, so I won’t give extension explanations.

2.2. What factors should be considered when selecting a crawler?

How to be anonymized? or what factors should be considered when selecting a crawler? The key is simple — choose the one that can match your requirements. For my requirements, I compared two common scrapers as follows.

· Selenium scraper
· Requests library + Tor

Selenium scraper is an automation test tool. It supports JS capabilities, which means it can help scarp data by simulating the browser actions.

Request Library + Tor: Request is a kind of traditional crawler. It helps scrape data by simulating the HTTP requests. Tor is a peer-to-peer proxy, it relays networks through computers to create a node of proxies so that it can avoid bans from the Server.

Basically, this scraping demo is based on www.bilibili.com, whose data is very friendly to access. So I choose Requests Lib + Tor as my scraper.

2.3. What kind of configurations are needed for Tor? How to scrap a web page with Tor?

2.3.1. Configurations:

To scrape data from website using Tor, we need to configure the following two important things:

AJAX Request, User-Agent

AJAX Request: in this case, we set aside “pn” in the URL

User-Agent: use Tor proxy to change the header information in the request

2.3.2. Tor methods:

Use requests library to request a page

request.get(url).text

Use a proxy to request a page

proxies = {
    'http' : 'socks5://127.0.0.1:9050',
    'https' : 'socks5://127.0.0.1:9050',
}
requests.get(url, proxies=proxies).text

Use stem to renew IP address

def renew_ip():
    with Controller.from_port(port = 9051) as controller:
    controller.authenticate()
    print("Success!")
    controller.signal(Signal.NEWNYM)
    print("New Tor connection processed")

Use fake_useragent to change the user agent

headers = {'User-Agent': UserAgent().random}
requests.get(url, proxies=proxies, headers=headers.text

Use Cron for Automation

wait = random.uniform(0, 2*60*60)
time.sleep(wait)

2.4. Implementation with python

First step: request to get the total information in one web page. The webpage has several limitations — one is volume limitation. We can only access the first 5 pages. The other is Asynchronized IO.

We can check the access volume limitation by this:

if req.json()["code"] == 22007:  # limitation error for page access

The Asynchronized IO can be solved by this:

tbc…

# Request to get the total
url = 'https://api.bilibili.com/x/relation/followers?vmid={}&pn=1&ps=500&order=desc'.format(channel_id) req = requests.get(url,headers=header)total = req.json()["data"]["total"]
print(len(req.json()["data"]["list"]))

There is one more limitation: the access should be executed by a logged-in user. To solve this problem, we can check the status_code before we ask for the JSON object from the webpage. If status_code = 200, then we have successfully logged in.

if req.status_code == 200:  # user has successful logged in

Second step: access the first 5 pages and parse the DOM tree to get the information that we want. Now, we have gotten all the DOM tree. To extract the specific data, in this case, user id, name, avatar, follow time, we need to deal with the JSON object from the webpage.

fans = req.json()["data"]["list"]   # get json object from requestfor fan in fans:
    do something...(extract, output and so on)

Final step: store the data into files. TigerGraph supports both .csv and .json files. In this case, we use .csv to store the data.

# header row
fansfile.write('"id","name","avatar"\n') #write in header row
followsfile.write('"from","to","sub_date"\n') #writer in header row# vertex csv file
fansfile.write('"{}", "{}", "{}" \n'.format(fan["mid"], fan["uname"], fan["face"]))# edge csv file
followsfile.write('"{}", "{}", "{}" \n'.format(fan["mid"], channel_id, datetime.fromtimestamp(fan["mtime"])))

Here are the output files:

 

fans.csv file
 

follows.csv file

2.5. Feed data into TigerGraph

If you are not familiar with TigerGraph, just check Akash’s blog https://www.tigergraph.com/blog/getting-started-with-tigergraph-3-0/ to learn.

2.5.1. Build schema and create the graph

Person vertex holds attributes: id, name, avatar
Subscribe_to edge holds attribute: sub_dateSchema: person –(subscribe_to)-> person
 

2.5.2. Load data from CSV files

 

2.5.3. Explore the Graph

Now you can explore the Graph using queries or Graph Studio!

 

 

3. Encryption Exploration

Let us drill into my favorite part of this blog — encryption and decryption.

In the first part, we mentioned Tor Network. Tor Network provides a smart way to encrypt users’ performance against spying. However, this process still has vulnerable components, which means even if you are diligent to build the onion network, your performance still can be spied somewhere in it. Looking into the Tor mechanism can help us find the reason.

In a normal way, when you connect to the internet, the server or the spy can know who you are and what you perform by the IP address you attach in your requests.

 

Normal access

In Tor Network, the client data including IP address will be encrypted and go through multiple relayed points (proxies). Only the client itself holds the keys to the encryptions. When a client sends requests to the server, the encryptions will be peered step by step until they arrive at the server. When the server tries to respond to the client, the encryptions will be added onto the response data step by step until they arrive at the client. The client will decrypt the data with keys.

 

Tor access

According to this picture, we can find out data can be protected from spy when going through the traffic line. The reason that data can be protected safely is that each relayed point is isolated, they do not know each other at all. The only information they hold is the input and output encrypted data. However, the components “1” and “2” are vulnerable. If some hacker puts his spies on these two positions, the client’s data can be accessed easily. He can also digest and compare the time signals between these two positions. (This is another story)

If you have any questions or suggestion, please check the whole project on my GitHub or contact me with LinkedIn: https://www.linkedin.com/in/linxiu-frances-jiang-961986117?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_contact_details%3BuFR2ggkDTOKCiS6focX2PQ%3D%3D

You Might Also Like

Trillion edges benchmark: new world record beyond 100TB by TigerGraph featuring AMD based Amazon EC2 instances

Trillion edges benchmark: new world record...

March 13, 2023
Graph Databases 101: Your Top 5 Questions with Non-Technical Answers

Graph Databases 101: Your Top 5...

February 7, 2023
It’s Time to Harness the Power of Graph Technology [Infographic]

It’s Time to Harness the Power...

January 25, 2023

Introducing TigerGraph 3.0

July 1, 2020

Everything to Know to Pass your TigerGraph Certification Test

June 24, 2020

Neo4j 4.0 Fabric – A Look Behind the Curtain

February 7, 2020

TigerGraph Blog

  • Categories
    • blogs
      • About TigerGraph
      • Benchmark
      • Business
      • Community
      • Compliance
      • Customer
      • Customer 360
      • Cybersecurity
      • Developers
      • Digital Twin
      • eCommerce
      • Emerging Use Cases
      • Entity Resolution
      • Finance
      • Fraud / Anti-Money Laundering
      • GQL
      • Graph Database Market
      • Graph Databases
      • GSQL
      • Healthcare
      • Machine Learning / AI
      • Podcast
      • Supply Chain
      • TigerGraph
      • TigerGraph Cloud
    • Graph AI On Demand
      • Analysts and Research
      • Customer 360 and Entity Resolution
      • Customer Spotlight
      • Development
      • Finance, Banking, Insurance
      • Keynote
      • Session
    • Video
  • Recent Posts

    • Trillion edges benchmark: new world record beyond 100TB by TigerGraph featuring AMD based Amazon EC2 instances
    • Overview of Graph and Machine Learning with TigerGraph | Mar 8 @ 11am PST
    • Gartner Data & Analytics Summit 2023, London
    • Gartner Data and Analytics Summit, Orlando
    • Transaction Surveillance with Maximum Flow Algorithm
    TigerGraph

    Product

    SOLUTIONS

    customers

    RESOURCES

    start for free

    TIGERGRAPH DB
    • Overview
    • Features
    • GSQL Query Language
    GRAPH DATA SCIENCE
    • Graph Data Science Library
    • Machine Learning Workbench
    TIGERGRAPH CLOUD
    • Overview
    • Cloud Starter Kits
    • Login
    • FAQ
    • Pricing
    • Cloud Marketplaces
    USEr TOOLS
    • GraphStudio
    • TigerGraph Insights
    • Application Workbenches
    • Connectors and Drivers
    • Starter Kits
    • openCypher Support
    SOLUTIONS
    • Why Graph?
    industry
    • Advertising, Media & Entertainment
    • Financial Services
    • Healthcare & Life Sciences
    use cases
    • Benefits
    • Product & Service Marketing
    • Entity Resolution
    • Customer 360/MDM
    • Recommendation Engine
    • Anti-Money Laundering
    • Cybersecurity Threat Detection
    • Fraud Detection
    • Risk Assessment & Monitoring
    • Energy Management
    • Network & IT Management
    • Supply Chain Analysis
    • AI & Machine Learning
    • Geospatial Analysis
    • Time Series Analysis
    success stories
    • Customer Success Stories

    Partners

    Partner program
    • Partner Benefits
    • TigerGraph Partners
    • Sign Up
    LIBRARY
    • Resources
    • Benchmark
    • Webinars
    Events
    • Trade Shows
    • Graph + AI Summit
    • Million Dollar Challenge
    EDUCATION
    • Training & Certifications
    Blog
    • TigerGraph Blog
    DEVELOPERS
    • Developers Hub
    • Community Forum
    • Documentation
    • Ecosystem

    COMPANY

    Company
    • Overview
    • Careers
    • News
    • Press Release
    • Awards
    • Legal
    • Patents
    • Security and Compliance
    • Contact
    Get Started
    • Start Free
    • Compare Editions
    • Online Demo - Test Drive
    • Request a Demo

    Product

    • Overview
    • TigerGraph 3.0
    • TIGERGRAPH DB
    • TIGERGRAPH CLOUD
    • GRAPHSTUDIO
    • TRY NOW

    customers

    • success stories

    RESOURCES

    • LIBRARY
    • Events
    • EDUCATION
    • BLOG
    • DEVELOPERS

    SOLUTIONS

    • SOLUTIONS
    • use cases
    • industry

    Partners

    • partner program

    company

    • Overview
    • news
    • Press Release
    • Awards

    start for free

    • Request Demo
    • take a test drive
    • SUPPORT
    • COMMUNITY
    • CONTACT
    • Copyright © 2023 TigerGraph
    • Privacy Policy
    • Linkedin
    • Facebook
    • Twitter

    Copyright © 2020 TigerGraph | Privacy Policy

    Copyright © 2020 TigerGraph Privacy Policy

    • SUPPORT
    • COMMUNITY
    • COMPANY
    • CONTACT
    • Linkedin
    • Facebook
    • Twitter

    Copyright © 2020 TigerGraph

    Privacy Policy

    • Products
    • Solutions
    • Customers
    • Partners
    • Resources
    • Company
    • START FREE
    START FOR FREE
    START FOR FREE
    TigerGraph
    PRODUCT
    PRODUCT
    • Overview
    • GraphStudio UI
    • Graph Data Science Library
    TIGERGRAPH DB
    • Overview
    • Features
    • GSQL Query Language
    TIGERGRAPH CLOUD
    • Overview
    • Cloud Starter Kits
    TRY TIGERGRAPH
    • Get Started for Free
    • Compare Editions
    SOLUTIONS
    SOLUTIONS
    • Why Graph?
    use cases
    • Benefits
    • Product & Service Marketing
    • Entity Resolution
    • Customer Journey/360
    • Recommendation Engine
    • Anti-Money Laundering (AML)
    • Cybersecurity Threat Detection
    • Fraud Detection
    • Risk Assessment & Monitoring
    • Energy Management
    • Network Resources Optimization
    • Supply Chain Analysis
    • AI & Machine Learning
    • Geospatial Analysis
    • Time Series Analysis
    industry
    • Advertising, Media & Entertainment
    • Financial Services
    • Healthcare & Life Sciences
    CUSTOMERS
    read all success stories

     

    PARTNERS
    Partner program
    • Partner Benefits
    • TigerGraph Partners
    • Sign Up
    RESOURCES
    LIBRARY
    • Resource Library
    • Benchmark
    • Webinars
    Events
    • Trade Shows
    • Graph + AI Summit
    • Graph for All - Million Dollar Challenge
    EDUCATION
    • TigerGraph Academy
    • Certification
    Blog
    • TigerGraph Blog
    DEVELOPERS
    • Developers Hub
    • Community Forum
    • Documentation
    • Ecosystem
    COMPANY
    COMPANY
    • Overview
    • Leadership
    • Careers  
    NEWS
    PRESS RELEASE
    AWARDS
    START FREE
    Start Free
    • Request a Demo
    • SUPPORT
    • COMMUNITY
    • CONTACT
    Dr. Jay Yu

    Dr. Jay Yu | VP of Product and Innovation

    Dr. Jay Yu is the VP of Product and Innovation at TigerGraph, responsible for driving product strategy and roadmap, as well as fostering innovation in graph database engine and graph solutions. He is a proven hands-on full-stack innovator, strategic thinker, leader, and evangelist for new technology and product, with 25+ years of industry experience ranging from highly scalable distributed database engine company (Teradata), B2B e-commerce services startup, to consumer-facing financial applications company (Intuit). He received his PhD from the University of Wisconsin - Madison, where he specialized in large scale parallel database systems

    Todd Blaschka | COO

    Todd Blaschka is a veteran in the enterprise software industry. He is passionate about creating entirely new segments in data, analytics and AI, with the distinction of establishing graph analytics as a Gartner Top 10 Data & Analytics trend two years in a row. By fervently focusing on critical industry and customer challenges, the companies under Todd's leadership have delivered significant quantifiable results to the largest brands in the world through channel and solution sales approach. Prior to TigerGraph, Todd led go to market and customer experience functions at Clustrix (acquired by MariaDB), Dataguise and IBM.