Use Ruby's Nokogiri library to capture national corporate credit information
Use Ruby's Nokogiri library to capture national corporate credit information
The following is a crawler program written using Ruby 's Nokogiri library, which is used to crawl the content on the national corporate credit information crawling website. This program uses a crawler IP server. The address of the crawler IP server is duoip:8000
.
require 'nokogiri'
require 'open-uri'
# Define a crawler ip server
proxy_host = 'duoip'
proxy_port = 8000
# Define a URL to crawl
url = 'gsxt.gov/cn/index.html' # Use the open-uri library to open the URL and fetch the content of the page.
# Use the open-uri library to open the URL and fetch the content of the page, using the crawler ip server
doc = Nokogiri::HTML(open(url, proxy: {http: "#{proxy_host}:#{proxy_port}"}))
# Find all the companies in the page
companies = doc.css('div.item')
# Iterate over each company
companies.each do |company|
# Get the name of the company
name = company.css('.name').text
# Get the address of the company
address = company.css('.address').text
# Output the company name and address
puts "#{name}, #{address}"
end
Here's an explanation of each line of code:
-
Line 1: Nokogiri and open-uri libraries are imported. Nokogiri is a very powerful Ruby library for parsing HTML and XML files. open-uri is a Ruby library for opening URLs.
-
Line 3: Defines the address of the crawler IP server. This address is an HTTP crawler IP server, which is used to hide your real IP address to avoid being blocked by the website.
-
Line 4: Defines the URL to be crawled. In this example, we want to crawl the homepage of the national corporate credit information crawling website.
-
Line 6: Use the Nokogiri library to open the URL and get the web content. At the same time, we specified the use of the crawler IP server .
-
Line 8: Use the CSS selector to find all the corporate information in the web page. This information is in an
div.item
HTML element named . -
Line 10: Traverse each enterprise information.
-
Line 11: The name of the company is obtained.
-
Line 12: The address of the enterprise is obtained.
-
Line 13: The name and address of the company are output.
Note: This program is just a basic example. Actual crawler programs may require more complex functions, such as processing JavaScript content in web pages, or handling pagination issues. When writing a crawler program, you must abide by the website's terms of use and do not place an excessive burden on the website.