웹페이지를 동적으로 로딩하여 데이터를 가져오는 웹 스크래핑이 필요한 상황이다.
이에 Selenium을 통해 스크래핑을 구현하려고 한다.
dependencies {
...
implementation 'org.seleniumhq.selenium:selenium-java:4.31.0'
}
아래 링크에서 맞는 버전과 OS를 찾아 다운로드한다.
https://github.com/GoogleChromeLabs/chrome-for-testing/blob/main/data/latest-versions-per-milestone-with-downloads.json
다운로드 후, 압축을 풀어 chromewebdriver파일의 경로를 복사한다.
Spring에서 해당 경로를 환경변수로 넣어 webdriver를 사용할 수 있도록 해야한다.
@Component
@RequiredArgsConstructor
public class ScarperUtil {
@Value("${scraper.web-driver-path}")
private String WEB_DRIVER_PATH;
private final Environment environment;
public WebDriver getWebDriver() {
System.setProperty("webdriver.chrome.driver", WEB_DRIVER_PATH);
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless");
options.addArguments("--no-sandbox");
options.addArguments("--disable-gpu");
options.addArguments("--disable-extensions");
options.addArguments("--disable-dev-shm-usage");
options.addArguments("--disable-extensions");
return new ChromeDriver(options);
}
}
application.yaml
파일에서 scraper.web-driver-path로 설정한 뒤, 이를 System.setProperty()
를 사용해 환경변수로 등록한다.ChromeOptions
에서 파라미터를 추가한 뒤, ChromeWebdriver를 리턴하여 Scraper를 진행할 Webdriver를 초기화할 수 있다.@Component
@RequiredArgsConstructor
public class Scraper {
private final ScraperUtil scraperUtil;
private final ObjectMapper objectMapper;
private static final String URL = "https://www.jobkorea.co.kr/recruit/joblist?menucode=duty";
private WebDriver initWebDriver() {
WebDriver driver = crawlerUtil.getWebDriver();
driver.get(URL);
return driver;
}
public List<RecruitResponseDto> scrapRecruits() {
WebDriver driver = initWebDriver();
List<WebElement> recruits = driver.findElements(By.cssSelector(".job.circleType.dev-tab.dev-duty .nano-content.dev-main .item"));
return recruits.stream()
.map(recruit -> recruit.getAttribute("data-value-json"))
.filter(Objects::nonNull)
.map(jsonString -> {
try {
return objectMapper.readValue(jsonString, RecruitResponseDto.class);
} catch (JsonProcessingException e) {
throw new RuntimeException(e);
}
})
.toList();
}
}
initWebDriver()
를 통해 스크래핑을 진행할 url을 설정하였다.scrapRecruit()
메서드에서 필요로 하는 요소들을 찾아 objectMapper
를 통해 자바 Object로 파싱하여 리턴하였다.apt-get update && apt-get install -y unzip
curl -Lo "/tmp/chromedriver-linux64.zip" "https://storage.googleapis.com/chrome-for-testing-public/136.0.7103.92/linux64/chromedriver-linux64.zip"
curl -Lo "/tmp/chrome-linux64.zip" "https://storage.googleapis.com/chrome-for-testing-public/136.0.7103.92/linux64/chrome-linux64.zip"
unzip /tmp/chromedriver-linux64.zip -d /opt/
unzip /tmp/chrome-linux64.zip -d /opt/
Error starting ApplicationContext. To display the condition evaluation report re-run your application with 'debug' enabled.
2025-05-07T18:36:46.285+09:00 ERROR 1 --- [xpact] [ main] o.s.boot.SpringApplication : Application run failed
org.openqa.selenium.remote.NoSuchDriverException: Unable to obtain: chromedriver, error chromedriver must exist: /home/ubuntu/chromedriver-linux64/chromedriver
For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors/driver_location/
Build info: version: '4.25.0', revision: '8a8aea2337'
chromedriver --port=9000
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless")
options.addArguments("--no-sandbox")
options.addArguments("--disable-gpu");
WebDriver driver = new RemoteWebDriver(
new URL("http://127.0.0.1:9000"),
options
);
WebDriver
를 RemoteWebDriver
로 사용하는 방법을 생각했다.selenium/standalone-chrome
이라는 chrome과 webdriver를 하나의 패키지로 묶은 이미지가 존재하여 이를 사용해보았다. selenium:
image: selenium/standalone-chrome:latest
container_name: selenium
ports:
- "4444:4444"
shm_size: 512mb
networks:
- mynetwork
restart: no
Caused by: org.openqa.selenium.TimeoutException: java.util.concurrent.TimeoutException
Build info: version: '4.25.0', revision: '8a8aea2337'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.8.0-1024-aws', java.version: '17.0.2'
Driver info: driver.version: RemoteWebDriver
at org.openqa.selenium.remote.http.jdk.JdkHttpClient.execute0(JdkHttpClient.java:418) ~[selenium-http-4.25.0.jar!/:na]
at org.openqa.selenium.remote.http.AddSeleniumUserAgent.lambda$apply$0(AddSeleniumUserAgent.java:42) ~[selenium-http-4.25.0.jar!/:na]
at org.openqa.selenium.remote.http.Filter.lambda$andFinally$1(Filter.java:55) ~[selenium-http-4.25.0.jar!/:na]
at org.openqa.selenium.remote.http.jdk.JdkHttpClient.execute(JdkHttpClient.java:374) ~[selenium-http-4.25.0.jar!/:na]
at org.openqa.selenium.remote.tracing.TracedHttpClient.execute(TracedHttpClient.java:54) ~[selenium-remote-driver-4.25.0.jar!/:na]
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:89) ~[selenium-remote-driver-4.25.0.jar!/:na]
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:75) ~[selenium-remote-driver-4.25.0.jar!/:na]
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:61) ~[selenium-remote-driver-4.25.0.jar!/:na]
at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:162) ~[selenium-remote-driver-4.25.0.jar!/:na]
at org.openqa.selenium.remote.TracedCommandExecutor.execute(TracedCommandExecutor.java:53) ~[selenium-remote-driver-4.25.0.jar!/:na]
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:545) ~[selenium-remote-driver-4.25.0.jar!/:na]
... 36 common frames omitted
Caused by: java.util.concurrent.TimeoutException: null
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960) ~[na:na]
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095) ~[na:na]
at org.openqa.selenium.remote.http.jdk.JdkHttpClient.execute0(JdkHttpClient.java:401) ~[selenium-http-4.25.0.jar!/:na]
... 46 common frames omitted
Caused by: org.openqa.selenium.SessionNotCreatedException: Could not start a new session. Response code 500. Message: Could not start a new session. Error while creating session with the driver service. Stopping driver service: Could not start a new session. Response code 500. Message: session not created: DevToolsActivePort file doesn't exist
Host info: host: '867a2996bab5', ip: '172.20.0.4'
Build info: version: '4.32.0', revision: 'd17c8aa950'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.8.0-1028-aws', java.version: '21.0.6'
Driver info: driver.version: unknown
Build info: version: '4.32.0', revision: 'd17c8aa950'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.8.0-1028-aws', java.version: '21.0.6'
Driver info: driver.version: unknown
Build info: version: '4.25.0', revision: '8a8aea2337'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.8.0-1028-aws', java.version: '17.0.2'
Driver info: org.openqa.selenium.remote.RemoteWebDriver
Command: [null, newSession {capabilities=[Capabilities {browserName: chrome, goog:chromeOptions: {args: [--headless=new, --no-sandbox, --disable-gpu, --disable-extensions, --disable-dev-shm-usage, --disable-extensions, --remote-allow-origins=*], extensions: []}}]}]
Capabilities {browserName: chrome, goog:chromeOptions: {args: [--headless=new, --no-sandbox, --disable-gpu, --disable-extensions, --disable-dev-shm-usage, --disable-extensions, --remote-allow-origins=*], extensions: []}}
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:114) ~[selenium-remote-driver-4.25.0.jar!/:na]
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:75) ~[selenium-remote-driver-4.25.0.jar!/:na]
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:61) ~[selenium-remote-driver-4.25.0.jar!/:na]
at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:162) ~[selenium-remote-driver-4.25.0.jar!/:na]
at org.openqa.selenium.remote.TracedCommandExecutor.execute(TracedCommandExecutor.java:53) ~[selenium-remote-driver-4.25.0.jar!/:na]
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:545) ~[selenium-remote-driver-4.25.0.jar!/:na]
at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:245) ~[selenium-remote-driver-4.25.0.jar!/:na]
at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:174) ~[selenium-remote-driver-4.25.0.jar!/:na]
at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:152) ~[selenium-remote-driver-4.25.0.jar!/:na]
at com.itstime.xpact.global.crawler.CrawlerUtil.getWebDriver(CrawlerUtil.java:36) ~[!/:0.0.1-SNAPSHOT]
... 32 common frames omitted
Dockerfile
을 통해 쉽게 실행환경을 설정할 수 있다는 장점이 있어 이를 통해 람다함수를 만들기로 하였다.from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import json
def lambda_handler(event, context):
chrome_options = Options()
chrome_options.binary_location = "/opt/chrome/chrome"
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--single-process")
service = Service(executable_path="/opt/chromedriver")
driver = webdriver.Chrome(
service=service,
options=chrome_options
)
driver.get("https://www.jobkorea.co.kr/recruit/joblist?menucode=duty")
recruits = driver.find_elements(By.CSS_SELECTOR, ".job.circleType.dev-tab.dev-duty .nano-content.dev-main .item")
results = []
for recruit in recruits:
json_str = recruit.get_attribute("data-value-json")
if json_str:
try:
data = json.loads(json_str)
results.append(data)
except json.JSONDecodeError:
pass
driver.quit()
return {
'statusCode': 200,
'body': results
}
lambda_handler
함수를 생성하였다.FROM --platform=linux/amd64 public.ecr.aws/lambda/python:3.13 AS build
# install unzip, install chrome & webdriver -> /opt에 위치하도록 설정
RUN dnf install -y unzip && \
curl -Lo "/tmp/chromedriver-linux64.zip" "https://storage.googleapis.com/chrome-for-testing-public/136.0.7103.92/linux64/chromedriver-linux64.zip" && \
curl -Lo "/tmp/chrome-linux64.zip" "https://storage.googleapis.com/chrome-for-testing-public/136.0.7103.92/linux64/chrome-linux64.zip" && \
unzip /tmp/chromedriver-linux64.zip -d /opt/ && \
unzip /tmp/chrome-linux64.zip -d /opt/
# chrome 실행에 필요한 패키지 설치
RUN dnf install -y atk cups-libs gtk3 libXcomposite alsa-lib libXcursor libXdamage libXext libXi libXrandr libXScrnSaver libXtst pango at-spi2-atk libXt xorg-x11-server-Xvfb xorg-x11-xauth dbus-glib dbus-glib-devel nss mesa-libgbm
# python에서 크롤링을 실행하기 위한 selenium 설치
RUN pip install -r selenium
COPY --from=build /opt/chrome-linux64 /opt/chrome
COPY --from=build /opt/chromedriver-linux64 /opt/
# main 실행
COPY main.py ./
CMD [ "main.lambda_handler" ]
Dcokerfile
을 통해 webdriver, chrome을 설치하고 main.py를 실행하도록 설정하였다.docker build -t {repository_name} .
로 이미지를 빌드한다.이제 AWS에 이미지를 올리기 위해 ECR(Elsatic Container Registry)에 리포지토리를 생성해야한다.
리포지토리 생성 후, push comment를 확인하면 다음과 같이 이미지를 생성한 리포지토리에 push하는 명령어를 보여준다.
이미지 푸시 후,
다음과 같이 이미지가 올라온 것을 확인할 수 있다.