스프링-크롤링(셀레니움) 간단 사용법

이진우·2023년 8월 2일
0

스프링 학습

목록 보기
9/41

크롤링이란?

크롤링은 웹 페이지를 자동으로 탐색 및 정보를 추출하는 프로세스를 의미한다.
만약 어떤 api가 필요한데 api를 제공을 안한고 그 정보가 웹사이트에 올라와 있다면 크롤링 프로세스의 도움을 받아서 그 정보를 DB에 저장하거나 출력할 수 있다.

크롤링 과정

웹 페이지 요청: 크롤러가 특정 웹페이지의 URL을 요청

웹 페이지 다운로드: 요청한 URL의 웹 페이지 내용이 다운로드, 크롤러에게 전달

데이터 추출: 다운로드된 웹 페이지에서 필요한 데이터를 추출. html의 class,id,tag 등을 활용해서 추출할 수 있다.

데이터 저장 및 처리: 추출한 데이터는 필요한 형식으로 가공, 데이터베이스에 저장 등 여러 작업 수행

스프링에서 셀레니움 사용해서 크롤링해보기

일단 크롬 브라우저의 버젼을 먼저 확인해야 한다.


여기서 설정을 들어간다.

설정에서 Chrome 정보에 들어가면

버젼을 쉽게 확인이 가능하다.

이제 아래 사이트에 접속해서
https://chromedriver.chromium.org/downloads
버젼에 맞는 Chromedriver를 다운받아 둔다.

build.gradle 추가

implementation group: 'org.seleniumhq.selenium', name: 'selenium-java', version: '4.10.0'

Component 추가

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Component;

import java.util.ArrayList;
import java.util.List;

@Component
public class CrawlingExample {
    private WebDriver webDriver;
    private static final String url="https://www.op.gg/summoners/kr/jugklng";

    private String crawlingDriver="C:\\Users\\LG PC\\Desktop\\chromedriver-win32\\chromedriver-win32\\chromedriver.exe";
    public void process(){
        System.setProperty("webdriver.chrome.driver",crawlingDriver);
        ChromeOptions chromeOptions = new ChromeOptions();
        chromeOptions.addArguments("--remote-allow-origins=*");
        webDriver=new ChromeDriver(chromeOptions);
        try{
            getData();
        }catch (InterruptedException e){
            e.printStackTrace();
        }
        webDriver.close();
        webDriver.quit();
    }
    private List<String> getData() throws InterruptedException{
        List<String> list=new ArrayList<>();
        webDriver.get(url);
        Thread.sleep(1000);
        List<WebElement> sentence=webDriver.findElements(By.className("tier"));
        for (WebElement webElement : sentence) {
            System.out.println(sentence.get(0).getText());
        }
        return list;
    }
}

1)url:우리가 추출하고 싶은 내용이 있는 url을 입력

2)crawlingDriver:chromedriver.exe 가 있는 경로를 입력해야 한다.

주의 해야 할점은 /인지.\인지 \인지 열심히 구분해야 한다.

3)System.setProperty("web...-->크롬 웹 드라이버의 경로를 설정

4)ChromeOptions 생성 코드->코드 그대로 ChromeOptions를 생성한다

5)addArguments("--remote-allow-origins=*")->이게 없으면

java.io.IOException: Invalid Status code=403 text=Forbidden
	at org.asynchttpclient.netty.handler.WebSocketHandler.abort(WebSocketHandler.java:92) ~[async-http-client-2.12.3.jar:na]
	at org.asynchttpclient.netty.handler.WebSocketHandler.handleRead(WebSocketHandler.java:118) ~[async-http-client-2.12.3.jar:na]
	at org.asynchttpclient.netty.handler.AsyncHttpClientHandler.channelRead(AsyncHttpClientHandler.java:78) ~[async-http-client-2.12.3.jar:na]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[netty-codec-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:333) ~[netty-codec-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:454) ~[netty-codec-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[netty-codec-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
	at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]

2023-08-02 22:25:52.221 ERROR 4264 --- [nio-8080-exec-2] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.openqa.selenium.remote.http.ConnectionFailedException: Unable to establish websocket connection to http://localhost:64263/devtools/browser/bfee20be-0351-472f-859e-c61fe5f22520
Build info: version: '4.1.4', revision: '535d840ee2'
System info: host: 'USER', ip: '192.168.0.13', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '11.0.2'
Driver info: driver.version: ChromeDriver] with root cause

org.openqa.selenium.remote.http.ConnectionFailedException: Unable to establish websocket connection to http://localhost:64263/devtools/browser/bfee20be-0351-472f-859e-c61fe5f22520
Build info: version: '4.1.4', revision: '535d840ee2'
System info: host: 'USER', ip: '192.168.0.13', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '11.0.2'
Driver info: driver.version: ChromeDriver
	at org.openqa.selenium.remote.http.netty.NettyWebSocket.<init>(NettyWebSocket.java:102) ~[selenium-remote-driver-4.1.4.jar:na]
	at org.openqa.selenium.remote.http.netty.NettyWebSocket.lambda$create$3(NettyWebSocket.java:128) ~[selenium-remote-driver-4.1.4.jar:na]
	at org.openqa.selenium.remote.http.netty.NettyClient.openSocket(NettyClient.java:105) ~[selenium-remote-driver-4.1.4.jar:na]
	at org.openqa.selenium.devtools.Connection.<init>(Connection.java:77) ~[selenium-remote-driver-4.1.4.jar:na]
	at org.openqa.selenium.chromium.ChromiumDriver.lambda$new$2(ChromiumDriver.java:124) ~[selenium-chromium-driver-4.1.4.jar:na]
	at java.base/java.util.Optional.map(Optional.java:265) ~[na:na]
	at org.openqa.selenium.chromium.ChromiumDriver.<init>(ChromiumDriver.java:122) ~[selenium-chromium-driver-4.1.4.jar:na]
	at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:106) ~[selenium-chrome-driver-4.1.4.jar:na]
	at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:93) ~[selenium-chrome-driver-4.1.4.jar:na]
	at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:48) ~[selenium-chrome-driver-4.1.4.jar:na]
	at com.crawling.crawlingpractice.component.CrawlingExample.process(CrawlingExample.java:24) ~[main/:na]
	at com.crawling.crawlingpractice.Service.CrawlingService.doCrawling(CrawlingService.java:13) ~[main/:na]
	at com.crawling.crawlingpractice.Controller.CrawlingController.showCrawling(CrawlingController.java:18) ~[main/:na]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:na]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:na]
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:na]
	at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[na:na]
	at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:205) ~[spring-web-5.3.29.jar:5.3.29]
	at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:150) ~[spring-web-5.3.29.jar:5.3.29]
	at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:117) ~[spring-webmvc-5.3.29.jar:5.3.29]
	at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:895) ~[spring-webmvc-5.3.29.jar:5.3.29]
	at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:808) ~[spring-webmvc-5.3.29.jar:5.3.29]
	at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) ~[spring-webmvc-5.3.29.jar:5.3.29]
	at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1072) ~[spring-webmvc-5.3.29.jar:5.3.29]
	at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:965) ~[spring-webmvc-5.3.29.jar:5.3.29]
	at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006) ~[spring-webmvc-5.3.29.jar:5.3.29]
	at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:898) ~[spring-webmvc-5.3.29.jar:5.3.29]
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:529) ~[tomcat-embed-core-9.0.78.jar:4.0.FR]
	at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883) ~[spring-webmvc-5.3.29.jar:5.3.29]
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:623) ~[tomcat-embed-core-9.0.78.jar:4.0.FR]
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:209) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:153) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:51) ~[tomcat-embed-websocket-9.0.78.jar:9.0.78]
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:178) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:153) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100) ~[spring-web-5.3.29.jar:5.3.29]
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117) ~[spring-web-5.3.29.jar:5.3.29]
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:178) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:153) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93) ~[spring-web-5.3.29.jar:5.3.29]
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117) ~[spring-web-5.3.29.jar:5.3.29]
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:178) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:153) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201) ~[spring-web-5.3.29.jar:5.3.29]
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117) ~[spring-web-5.3.29.jar:5.3.29]
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:178) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:153) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:167) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:90) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:481) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:130) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:93) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:390) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:63) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:926) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1791) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
	at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]
   

이런 오류 코드가 날 수 있다.

6)getData가 실질적으로 데이터를 가져올 수 있는 코드이다

7) getUrl은 셀레니움이 지정된 URL의 웹 페이지를 자동으로 열어주는 역할을 한다

8) Thread.sleep(1000)은 페이지가 모두 로딩될때까지 기다려야 파싱 결과를 정상적으로 얻어올 수 있기에 필요한 코드이다.

9)List sentence=webDriver.findElements(By.className("tier"));는 "tier"라는 클래스 요소로 element를 찾는 코드이다.

크롤링을 더 빨리 하고 싶다면?

        chromeOptions.addArguments("--remote-allow-origins=*");
        chromeOptions.addArguments("--disable-popup-blocking");       //팝업안띄움
        chromeOptions.addArguments("headless");                       //브라우저 안띄움
        chromeOptions.addArguments("--disable-gpu");			//gpu 비활성화
        chromeOptions.addArguments("--blink-settings=imagesEnabled=false"); //이미지 다운 안받음

이렇게 옵션에 여러가지 설정을 추가하여 주면 된다!

Service-CrawlingService

  @Service
@RequiredArgsConstructor
public class CrawlingService {
    private final CrawlingExample crawlingExample;

    public void doCrawling(){
        crawlingExample.process();
    }

}

Controller-CrawlingController

  @RestController
@RequiredArgsConstructor
@RequestMapping("/crawling")
public class CrawlingController {
    private final CrawlingService crawlingService;

    @GetMapping("/show")
    public void showCrawling(){
        crawlingService.doCrawling();
    }
}
  

Postman을 통한 테스트


위의 URL에 요청을 보내면


이렇게 출력이 잘 되는 것을 볼 수 있다.

profile
기록을 통해 실력을 쌓아가자

0개의 댓글