โฌ๏ธ Main Note
https://docs.google.com/document/d/1IZ5yYEtX92E7k2ijoAZZB3W_nBG9MpGPX6OKk_POxLQ/edit
inspect/developer tools command: command
+ option
+ i
There is <em>
tag in elements.
--> Bringing the data is scraping and doing whatever else with that data depends on the developer.
</>
=> hyper markup language<Writer/>
, <School/>
, etc<Name>JB</Name>
format was used.GET https://naver.com
: able to get the data of elements. Cheerio
Cheerio helps to get HTML tags into string form. [tool]
When we send particular links to some other sites, for example like discord, there pops out a preview image and title on the link box.
When a site is created, there is meta tag and property added to og in the head tag. Here, Discord developers create these tags.
--> Creating link-preview
og is created by Facebook, where Facebook first wanted to create the link-preview. og stands for open graph.
If I'm creating my own site and the site address is mysite.com
, meta tag should be initially created in the head tag.
<meta og: title />
, <meta og: image/>
import axios from "axios"
import cheerio from "cheerio"
async function createBoardAPI(mydata){ // mydata <== frontendData ๋ฐ์ดํฐ ๊ฐ์ ธ์ค๊ธฐ
const targetUrl = mydata.contents.split(" ").filter((el) => el.startsWith("http"))[0]
//๊ณต๋ฐฑ์ ๊ธฐ์ค์ผ๋ก split์ ํ๋ฉด ํ ๋จ์ด์ฉ ์ฐ๋ ค์ ๋ฐฐ์ด๋ก ์ถ๋ ฅ๋จ //=> ์ด๋ http๋ก ์์ํ๋ ์ ๋ฅผ ๊ฐ์ ธ์ค๋ฉด ๋๋๊ฑฐ
// ์ด๋ ๊ฒ ํ๋ฉด ์ต์ข
๊ฒฐ๊ณผ๋ก "http๋ก ์์ํ๋๊ฑฐ ํ๋๋ง" ๋ฐฐ์ด์ ๋ค์ด์ค๊ฒ ๋จ
// ๊ทธ ๋ฐฐ์ด์ 0๋ฒ์งธ๋ฅผ ๋ฝ์์์ผ ์์ํ๊ฒ ์ฃผ์๋ง ๋ฝ์์ฌ ์๊ฐ ์๋๊ฑฐ์
const aaa = await axios.get(targetUrl)
const $ = cheerio.load(aaa.data)
$("meta").each((_, el) => { // ๋ฉํํ๊ทธ๋ค๋ง ์ญ ๋ฝํ์ ธ ๋์ค๋๊ฑฐ์ => .each : for๋ฌธ์ฒ๋ผ ์์ฉ (meta์ ๋ชจ๋ ํ๊ทธ์์ ์๋ํด์ค)
// _ :๋ช๋ฒ์งธ meta tag์ธ์ง // el=element => ex) 3๋ฒ์งธ๋ฉด 3๋ฒ์งธ meta tag์ ๋ด์ฉ์ ๊ฐ์ ธ์ค๋๊ฒ
// ์ฐ๋ฆฌํํ
ํ์ํ๊ฑด og: ๊ฐ ํฌํจ๋์ด์๋ meta tag
// $๊ฐ ํน์ ํ๊ทธ๋ฅผ ์ปจํธ๋กค ํ๋ ์
if ($(el).attr('property')){ // $("meta").each((_, el) => { ์ธ ์ํ๋ก ํ๋ฉด ๋ชจ๋ meta tag๋ฅผ ๋์๋ณด๊ธฐ ๋๋ฌธ์ ๋นํจ์จ์ ์. ๊ทธ๋์ if ๋ฌธ ๊ฐ๋
const key = $(el).attr('property').split(":")[1] //์์ฑ์ด property์ธ, og: ์ ๊ฐ์ง๊ณ ์๋ ์์ฑ์ ์ฐพ๋๊ฒ
// ==> split(":") --> :์ ๊ธฐ์ค์ผ๋ก og์ url์ด ๋๋ ์ง ['og', 'title'] ์ด๋ฐ์์ผ๋ก ์ฌ๊ธฐ์ title์ 1๋ฒ์งธ ์ธ๋ฑ์ค์ ์๋๊ฑฐ์
// title --> key, "๋ค์ด๋ฒ" --> value
const value = $(el).attr('content') // ๋ค์ด๋ฒ๋ผ๋ ๋จ์ด๊ฐ ๋์ด
console.log(key, value)
}
})
}
const frontendData = { // frontend์์ ๊ฒ์๋ฌผ์ ๋ฑ๋กํ ๋ ์๋ ๋ด์ฉ์ ๋ฑ๋กํ๋ค:
title: "Hi there, this is my title ๐ ",
contents: "The weather's nice today. I want you guys to visit this site: https://naver.com ์
๋๋ค~"
}
createBoardAPI(frontendData)
onclick
is an attribute (์์ฑ)
Property
is also an attribute
<meta og: title/>
When scraping happens constantly, that becomes crawling.
When I want to do something after opening a browser, Puppeteer is used. [tool]
// ์ฌ๊ธฐ์ด๋ ํฌ๋กค๋ง ์๋ฒ ์ฌ๋ก: https://biz.chosun.com/topics/law_firm/2021/09/29/OOBWHWT5ZBF7DESIRKNPYIODLA/
// ๋ฌด์ฐจ๋ณ์ ์ผ๋ก ํฌ๋กค๋ง์ ์์ฒญํ๋ค๋ณด๋ฉด ์ ์์๊ฐ ๋ง์์ ธ์ ๋ฉ๋ชจ๋ฆฌ๊ฐ ๋ง์ด ํ์ํ๊ฐ ๋จ => ์ด๋ฌ๋ฉด ๋ ๋ง์ ์ปดํจํฐ๊ฐ ํ์ํด์ง๊ฒ ๋จ
import puppeteer from 'puppeteer'
async function startCrawling(){ //ํ๋์ฉ ๋ค ๊ธฐ๋ค๋ ค์ค์ผํจ (๋ธ๋ผ์ฐ์ ์ด๊ณ ์ฐฝ ์ด๊ณ )
const browser = await puppeteer.launch({headless: false}) // ๋ธ๋ผ์ฐ์ ๋ํ๋จ
const page = await browser.newPage() // ์ ํ์ด์ง ์ด๊ธฐ
await page.setViewport({width: 1280, height: 720}) // page ํฌ๊ธฐ๋ ์ง์ ๊ฐ๋ฅํจ
await page.goto("https://www.goodchoice.kr/product/search/2") // chromium ๋ธ๋ผ์ฐ์ ๋ก ์ด๋ํ๊ฒ ๋จ // chromium์ ๊ธฐ๋ฐ์ผ๋ก ํด์ ๋ง๋ค์ด์ง ๋ธ๋ผ์ฐ์ ๊ฐ ํฌ๋กฌ์ (๋์ ์ ํ ๋ค๋ฅธ๊ฑฐ)
page.waitForTimeout(1000) // ์ ์ํ๊ณ ์๊ฐํ
์ ์ฃผ๊ณ ์ ์ํ๋๊ฑฐ์
const star = await page.$eval("#poduct_list_area > li:nth-child(2) > a > div > div.name > div > span", (el) => el.textContent)
//$eval์ ํ๊ฐ์ ๋ํด์, $$eval์ ์ฌ๋ฌ๊ฐ ์ ํํ ๋ // '>' => ์์์ผ๋ก ์๋ ํ๊ทธ //div์ ์์์ด span์ด๋ค
// child()์ ์ซ์๋ง ๋ค๋ฆ => ๋ค๋ฅธ ํธํ
์ฑ๊ธ: #poduct_list_area > li:nth-child(3) > a > div > div.name > div > span //=> ์ด๋ฌ๋ฉด for๋ฌธ ๋๋ ค์ ๋ชจ๋ ๋ฐ์ดํฐ ๊ฐ์ ธ์ค๊ธฐ ๊ฐ๋ฅ
page.waitForTimeout(1000)
const location = await (await page.$eval("#poduct_list_area > li:nth-child(2) > a > div > div.name > p:nth-child(4)", (el)=> el.textContent)).trim()
page.waitForTimeout(1000)
const price = await page.$eval("#poduct_list_area > li:nth-child(2) > a > div > div.price > p > b", (el) => el.textContent)
page.waitForTimeout(1000)
console.log("โญ๏ธ star:", star)
console.log("๐ location:", location)
console.log("๐ณ Price:", price)
await browser.close() // crawling ๋๋๋ฉด browser ์ข
๋ฃํด์ฃผ๊ธฐ
}
startCrawling()