Scraping | Crawling

Jay·2022년 5월 22일
0

Scraping

Scraping involves extracting large amount of data from a specified webpage. This process involves retrieving the full HTML page of the website and extracting useful information that we are interested in.

In this example demo, I have used the cheerio tool to extract OG tags from a sample webpage.

async function createBoardAPI(mydata) {
  const myurl = mydata.contents.split(" ").filter(
  (ele)=>(ele.includes("http")))[0]
  
  const result = await axios.get(myurl)
}

First, we extract the url part from the argument string. The url will be stored in a variable named 'myurl'. After separating the url from the rest of the string, we request the HTML data using axios request, storing the HTML into the result variable.

const $ = cheerio.load(result.data)
    $("meta").each((_, ele) => {
        if ($(ele).attr("property")) {
            const key = $(ele).attr("property").split(":")[1]
            const value = $(ele).attr("content")
            console.log(`${key}: ${value}`)
        }
    })

Cheerio loads the data, looping over the useful parts of the HTML file we are interested in. In this case, since we are trying to extract the OG meta tags of the webpage, we look at all parts with "meta" included in its tag. Looping over each element, we separate the property and value, storing them into key and value respectively. This part requires looking into the HTML data and identifying the appropriate id or class names manually.

The data can now be stored as an object or in a data structure like a map.

Crawling

While crawling and scraping are similar in that they both extract large amounts of data from an external webpage, crawling involves scraping data in multiple intervals. It can extract data in a more automated way, including those nested in iframes or located in different pages (rather than single ones). In this example, I use the Puppeteer tool to crawl stock data.

async function startCrawling() {
    const browser = await puppeteer.launch({ headless: false })
    const page = await browser.newPage()
    await page.setViewport({ width: 1280, height: 720 })
    await page.goto('https://finance.naver.com/item/sise.naver?code=005930')
    await page.waitForTimeout(1000)

The puppeteer tool opens the Chromium browser to open the webpage of interest. The additional options can specify whether to make this process visible or invisible, and if visible, how large the view window will be. Since multiple, regular traffic to certain webpages can be mistaken as attacks to the webpage, we set a time out for one second.

const frame = await page.frames().find((ele) => ele.url()
.includes("/item/sise_day.naver?code=005930"))
    

Data located within an frame needs to be extracted by using a different puppeteer API

const date = await frame.$eval(`body > table.type2 > tbody > 
tr:nth-child(${i}) > td:nth-child(1) > span`,
(ele) => ele.textContent)

const price = await frame.$eval(`body > table.type2 > tbody > 
tr:nth-child(${i}) > td:nth-child(2) > span`,
(ele) => ele.textContent)

console.log(date, price)

Using the frame.$eval keyword, we provide the selector path of the data and extract its textContent (multiple extractions can be done using double $ eval API)

Similarly, the dates and prices of certain stock information can be now stored in data structures.

In this demo, we have used cheerio and puppeteer packages to extract large amounts of information from HTML files of webpages. Although they can serve as useful information, there definitely exists ethical concerns (related to data ownership and causing over traffic) Always be cautious about crawling external data for own use.

0개의 댓글