웹 페이지 다운로드: 요청한 URL의 웹 페이지 내용이 다운로드, 크롤러에게 전달
데이터 추출: 다운로드된 웹 페이지에서 필요한 데이터를 추출. html의 class,id,tag 등을 활용해서 추출할 수 있다.
데이터 저장 및 처리: 추출한 데이터는 필요한 형식으로 가공, 데이터베이스에 저장 등 여러 작업 수행
일단 크롬 브라우저의 버젼을 먼저 확인해야 한다.
여기서 설정을 들어간다.
설정에서 Chrome 정보에 들어가면
버젼을 쉽게 확인이 가능하다.
이제 아래 사이트에 접속해서
https://chromedriver.chromium.org/downloads
버젼에 맞는 Chromedriver를 다운받아 둔다.
implementation group: 'org.seleniumhq.selenium', name: 'selenium-java', version: '4.10.0'
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Component;
import java.util.ArrayList;
import java.util.List;
@Component
public class CrawlingExample {
private WebDriver webDriver;
private static final String url="https://www.op.gg/summoners/kr/jugklng";
private String crawlingDriver="C:\\Users\\LG PC\\Desktop\\chromedriver-win32\\chromedriver-win32\\chromedriver.exe";
public void process(){
System.setProperty("webdriver.chrome.driver",crawlingDriver);
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.addArguments("--remote-allow-origins=*");
webDriver=new ChromeDriver(chromeOptions);
try{
getData();
}catch (InterruptedException e){
e.printStackTrace();
}
webDriver.close();
webDriver.quit();
}
private List<String> getData() throws InterruptedException{
List<String> list=new ArrayList<>();
webDriver.get(url);
Thread.sleep(1000);
List<WebElement> sentence=webDriver.findElements(By.className("tier"));
for (WebElement webElement : sentence) {
System.out.println(sentence.get(0).getText());
}
return list;
}
}
1)url:우리가 추출하고 싶은 내용이 있는 url을 입력
2)crawlingDriver:chromedriver.exe 가 있는 경로를 입력해야 한다.
주의 해야 할점은 /인지.\인지 \인지 열심히 구분해야 한다.
3)System.setProperty("web...-->크롬 웹 드라이버의 경로를 설정
4)ChromeOptions 생성 코드->코드 그대로 ChromeOptions를 생성한다
5)addArguments("--remote-allow-origins=*")->이게 없으면
java.io.IOException: Invalid Status code=403 text=Forbidden
at org.asynchttpclient.netty.handler.WebSocketHandler.abort(WebSocketHandler.java:92) ~[async-http-client-2.12.3.jar:na]
at org.asynchttpclient.netty.handler.WebSocketHandler.handleRead(WebSocketHandler.java:118) ~[async-http-client-2.12.3.jar:na]
at org.asynchttpclient.netty.handler.AsyncHttpClientHandler.channelRead(AsyncHttpClientHandler.java:78) ~[async-http-client-2.12.3.jar:na]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[netty-codec-4.1.94.Final.jar:4.1.94.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:333) ~[netty-codec-4.1.94.Final.jar:4.1.94.Final]
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:454) ~[netty-codec-4.1.94.Final.jar:4.1.94.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[netty-codec-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[netty-transport-4.1.94.Final.jar:4.1.94.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.94.Final.jar:4.1.94.Final]
at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]
2023-08-02 22:25:52.221 ERROR 4264 --- [nio-8080-exec-2] o.a.c.c.C.[.[.[/].[dispatcherServlet] : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.openqa.selenium.remote.http.ConnectionFailedException: Unable to establish websocket connection to http://localhost:64263/devtools/browser/bfee20be-0351-472f-859e-c61fe5f22520
Build info: version: '4.1.4', revision: '535d840ee2'
System info: host: 'USER', ip: '192.168.0.13', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '11.0.2'
Driver info: driver.version: ChromeDriver] with root cause
org.openqa.selenium.remote.http.ConnectionFailedException: Unable to establish websocket connection to http://localhost:64263/devtools/browser/bfee20be-0351-472f-859e-c61fe5f22520
Build info: version: '4.1.4', revision: '535d840ee2'
System info: host: 'USER', ip: '192.168.0.13', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '11.0.2'
Driver info: driver.version: ChromeDriver
at org.openqa.selenium.remote.http.netty.NettyWebSocket.<init>(NettyWebSocket.java:102) ~[selenium-remote-driver-4.1.4.jar:na]
at org.openqa.selenium.remote.http.netty.NettyWebSocket.lambda$create$3(NettyWebSocket.java:128) ~[selenium-remote-driver-4.1.4.jar:na]
at org.openqa.selenium.remote.http.netty.NettyClient.openSocket(NettyClient.java:105) ~[selenium-remote-driver-4.1.4.jar:na]
at org.openqa.selenium.devtools.Connection.<init>(Connection.java:77) ~[selenium-remote-driver-4.1.4.jar:na]
at org.openqa.selenium.chromium.ChromiumDriver.lambda$new$2(ChromiumDriver.java:124) ~[selenium-chromium-driver-4.1.4.jar:na]
at java.base/java.util.Optional.map(Optional.java:265) ~[na:na]
at org.openqa.selenium.chromium.ChromiumDriver.<init>(ChromiumDriver.java:122) ~[selenium-chromium-driver-4.1.4.jar:na]
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:106) ~[selenium-chrome-driver-4.1.4.jar:na]
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:93) ~[selenium-chrome-driver-4.1.4.jar:na]
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:48) ~[selenium-chrome-driver-4.1.4.jar:na]
at com.crawling.crawlingpractice.component.CrawlingExample.process(CrawlingExample.java:24) ~[main/:na]
at com.crawling.crawlingpractice.Service.CrawlingService.doCrawling(CrawlingService.java:13) ~[main/:na]
at com.crawling.crawlingpractice.Controller.CrawlingController.showCrawling(CrawlingController.java:18) ~[main/:na]
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:na]
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:na]
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:na]
at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[na:na]
at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:205) ~[spring-web-5.3.29.jar:5.3.29]
at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:150) ~[spring-web-5.3.29.jar:5.3.29]
at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:117) ~[spring-webmvc-5.3.29.jar:5.3.29]
at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:895) ~[spring-webmvc-5.3.29.jar:5.3.29]
at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:808) ~[spring-webmvc-5.3.29.jar:5.3.29]
at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) ~[spring-webmvc-5.3.29.jar:5.3.29]
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1072) ~[spring-webmvc-5.3.29.jar:5.3.29]
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:965) ~[spring-webmvc-5.3.29.jar:5.3.29]
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006) ~[spring-webmvc-5.3.29.jar:5.3.29]
at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:898) ~[spring-webmvc-5.3.29.jar:5.3.29]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:529) ~[tomcat-embed-core-9.0.78.jar:4.0.FR]
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883) ~[spring-webmvc-5.3.29.jar:5.3.29]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:623) ~[tomcat-embed-core-9.0.78.jar:4.0.FR]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:209) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:153) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:51) ~[tomcat-embed-websocket-9.0.78.jar:9.0.78]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:178) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:153) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100) ~[spring-web-5.3.29.jar:5.3.29]
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117) ~[spring-web-5.3.29.jar:5.3.29]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:178) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:153) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93) ~[spring-web-5.3.29.jar:5.3.29]
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117) ~[spring-web-5.3.29.jar:5.3.29]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:178) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:153) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201) ~[spring-web-5.3.29.jar:5.3.29]
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117) ~[spring-web-5.3.29.jar:5.3.29]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:178) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:153) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:167) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:90) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:481) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:130) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:93) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:390) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:63) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:926) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1791) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) ~[tomcat-embed-core-9.0.78.jar:9.0.78]
at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]
이런 오류 코드가 날 수 있다.
6)getData가 실질적으로 데이터를 가져올 수 있는 코드이다
7) getUrl은 셀레니움이 지정된 URL의 웹 페이지를 자동으로 열어주는 역할을 한다
8) Thread.sleep(1000)은 페이지가 모두 로딩될때까지 기다려야 파싱 결과를 정상적으로 얻어올 수 있기에 필요한 코드이다.
9)List sentence=webDriver.findElements(By.className("tier"));는 "tier"라는 클래스 요소로 element를 찾는 코드이다.
chromeOptions.addArguments("--remote-allow-origins=*");
chromeOptions.addArguments("--disable-popup-blocking"); //팝업안띄움
chromeOptions.addArguments("headless"); //브라우저 안띄움
chromeOptions.addArguments("--disable-gpu"); //gpu 비활성화
chromeOptions.addArguments("--blink-settings=imagesEnabled=false"); //이미지 다운 안받음
이렇게 옵션에 여러가지 설정을 추가하여 주면 된다!
@Service
@RequiredArgsConstructor
public class CrawlingService {
private final CrawlingExample crawlingExample;
public void doCrawling(){
crawlingExample.process();
}
}
@RestController
@RequiredArgsConstructor
@RequestMapping("/crawling")
public class CrawlingController {
private final CrawlingService crawlingService;
@GetMapping("/show")
public void showCrawling(){
crawlingService.doCrawling();
}
}
위의 URL에 요청을 보내면
이렇게 출력이 잘 되는 것을 볼 수 있다.