TTS 구현하기 (Text To Speech)

camel·2023년 4월 13일

외부 라이브러리 없이, 웹 표준을 만족하는 API인 Web Speech API 만으로 TTS를 구현하였다.

Web Speech API는 크게 SpeechSynthesis(Text-to-Speech)와 SpeechRecognition (Asynchronous Speech Recognition) 두 가지로 나뉜다.
SpeechSynthesis는 텍스트를 음성으로 변환하는 API이고, SpeechRecognition은 음성을 텍스트로 변환하는 API이다.

우리가 사용해야할 API는 SpeechSynthesis이다.

async function populateVoiceList(synth: SpeechSynthesis) {
    try {
      const voices = await synth.getVoices().sort(function (a, b) {
        const aname = a.name.toUpperCase();
        const bname = b.name.toUpperCase();
        if (aname < bname) return -1;
        else if (aname === bname) return 0;
        else return +1;
      });
  
      return voices;
    } catch (error) {
      throw new Error("Failure retrieving voices");
    }
}

먼저
populateVoiceList 함수의 매개변수 synth는 SpeechSynthesis 객체이다. 이 객체는 브라우저의 음성 합성 API를 제어하는 데 사용하는 객체입니다. synth 객체를 사용하여 텍스트를 음성으로 합성할 수 있습니다.

populateVoiceList 함수는 synth 객체의 getVoices 메서드를 호출하여 시스템에 설치된 모든 음성을 가져온 후, 알파벳 순서로 정렬한 목록을 반환합니다. 이렇게 정렬된 목록은 speak 함수에서 사용할 수 있게 됩니다. populateVoiceList 함수는 speak 함수의 동작에 필요한 정보를 가져오는 데 사용되며, 음성 합성 기능을 사용하기 전에 반드시 실행되어야 합니다.

참고한 로직에는 작성되어서 의미를 해석하였는데,
실제로 확인해봤을 때 의미가 없었다. synth도 실제로 값이 들어오지 않았다.

const pitch = 1;
const rate = 1;

export async function speak(textToRead: string, synth: SpeechSynthesis) {
    if (speechSynthesis.onvoiceschanged !== undefined) {
      speechSynthesis.onvoiceschanged = () => populateVoiceList
    }
  
    if (synth.speaking) {
      console.error("speechSynthesis.speaking")
      return
    }

synth 는 window.speechSynthesis가 들어올 것이고, textToRead는 string이 들어올 것이다.

speechSynthesis.onvoiceschanged는 음성 합성 API에서 사용할 수 있는 음성 목록이 변경될 때마다 발생한다. 즉, 시스템에 새로운 음성이 추가되거나 기존의 음성이 삭제될 때마다 onvoiceschanged 이벤트가 발생합니다.

speechSynthesis.onvoiceschanged 속성에 함수를 할당하면, 이벤트가 발생할 때마다 해당 함수가 실행됩니다.
시스템에 설치된 음성 목록이 변경될 때마다, 항상 최신의 음성 목록을 가져와서 사용할 수 있게 됩니다.

synth.speaking는 음성이 진행중인지 Boolean 값을 뱉는다. 확인해보았을 때 false만 떠서 체킹이 되지않았다.

if (textToRead !== "") {
      const utterThis = new SpeechSynthesisUtterance(textToRead)
      utterThis.onend = function (event) {
      }
      utterThis.onerror = function (event) {
        console.error("SpeechSynthesisUtterance.onerror")
      }
      utterThis.pitch = pitch
      utterThis.rate = rate
      synth.speak(utterThis)
    }
}

SpeechSynthesisUtterance는 음성 합성에 사용할 텍스트와 음성 합성에 관련된 설정을 포함하는 객체이다. 생성자 SpeechSynthesisUtterance(text)는 매개변수 text로 전달된 문자열을 사용하여 SpeechSynthesisUtterance 객체를 생성한다. 이는
음성 합성에 대한 정보를 포함하는 객체입니다.

pitch와 rate는 각각 음성의 높이와 속도를 조절하기 위한 속성값입니다. pitch는 음성의 높이를 조절하며, rate는 음성의 속도를 조절합니다.

speak 메소드는 SpeechSynthesis 객체를 사용하여 SpeechSynthesisUtterance 객체를 음성으로 합성합니다. 따라서 synth.speak(utterThis)는 synth 객체를 사용하여 utterThis 객체를 음성으로 합성하고, 객체를 음성으로 출력할 수 있다.

구현
https://solo-project-one.vercel.app/video