while scrolling through youtube, i came across this video on my feed by
javidx9 -
https://youtu.be/tgamhuQnOkM?si=QnRopBay40F54sAH
about building a sound synthesizer in c++, so i thought of giving audio
programming a shot. in this blog post, we'll cover few of the basic topics
and build a simple program which would generate a sine wave of a certain
frequency and save it to a .wave
file, amplify it and peform
stereopanning on a given .wave
file.
sound is a phenomenon caused by vibration in particles that propagates as a wave through a transmission medium such as air, water, or solids. most of the sounds in the real world propagate in the form of a sine wave (or combination of different sine waves).
for example, the note A above middle C on the piano propagates as a (almost) pure sine wave with a frequency of 440Hz (ref: how can a piano key only have one frequency?) and it can be mathematically represented as follows:
y = sin(880πx)
sound is a mechanical wave energy, while audio is the electrical representation of that sound wave.
a microphone converts the mechanical sound waves into analog signals which is later passed through an analog-digital converter (ADC) which converts these analog signals into digital signals, which would be understood by a computer.
there are two important keywords in the context of digital audio - bit rate and sample frequency
the analog signals hold information about various wave characteristics at that particular instant such as the amplitude. sample frequency is the number of times a snapshot of these characteristics is taken and these snapshots are later used to re-create the sound wave.
most of the audio which is delivered nowadays either uses 44.1 kHz or 48 kHz as the sample frequency. the frequency limit of humans range from 20 Hz to 22 kHz. nyquist rate is the minimum sampling rate needed to accurately represent a signal and it is twice the highest frequency of the signal.
44100 = 2 * 22000 + 100
why isn't it 44 kHz? well, an additional 100 Hz sorta acts like a transition band or room for error, which prevents unwanted distortion in the higher frequencies.
in the early days, digital audio was stored on modified video recorders and 44.1 kHz worked perfectly with the video equipment at that time and it became the industry standard.
bit depth is related to the precision of each snapshot. if the bit depth is 16 then the maximum number which could be represented is +32767 (((2^16 - 1) - 1)/2) and the least is -32767. a snapshot could either have +ve or -ve amplitude. so in simple words - the higher the bit depth, the more clearly it is represented digitally, at the regions with really high or really low frequencies.
let's a simple program which would generate a sine wave of 440 Hz.
package main
import (
"encoding/binary"
"fmt"
"math"
"os"
"time"
)
const (
duration = 5
sampleRate = 44100
freq = 440
)
func main() {
ns := duration * sampleRate
angle := (math.Pi * 2.0) / float64(ns)
f, err := os.Create("wave.bin")
if err != nil {
panic(err.Error())
}
start := time.Now()
for i := 0; i < ns; i++ {
sample := math.Sin(angle * freq * float64(i))
var buf [4]byte
binary.LittleEndian.PutUint32(buf[:], math.Float32bits(float32(sample)))
if _, err := f.Write(buf[:]); err != nil {
panic(err.Error())
}
}
fmt.Printf("done - %dms\n", time.Since(start).Milliseconds())
}
the above program generates a .bin
file containing binary
representation of the audio samples, at a sample rate of 44.1kHz.
y = sin(440x)
as sample rate is number of samples taken per second, number of samples can be found out with the help of duration and sample rate:
number of samples = (sample rate) * (duration)
angle
is the angular increment per sample.
sample
is snapshot of the wave characteristics at that
moment, in this case it is the amplitude i.e. value of the function at
that point.
the sample (which is a floating point number) is converted to its
corresponding little-endian byte representation and then written to
wave.bin
. i'm converting it to little-endian as my CPU (intel
i5) uses little-endian. check which byte representation your machine's CPU
follows via the following command:
lscpu | grep "Byte Order"
to play the audio, we can use ffplay
ffplay -f f32le -ar 44100 -showmode 1 wave.bin
-f
specifies the file format. f32le
indicates
that the audio is encoded in 32-bit litte-endian byte format.
-ar
specifies the audio sample rate, which is 44.1 kHz in
this case.
-showmode 1
opens a GUI showing the sine wave re-created
from the samples.
on running the above ffplay
command, you should hear a sound
something similar to -
rec.mp4
right now, the audio abruptly ends. let's fix that by adding exponential decay. exponential decay keeps on gradually decreasing the amplitude, which leads to a neat fade away sorta effect.
startAmplitude := 1.0
endAmplitude := 1.0e-4
decayFactor := math.Pow(endAmplitude/startAmplitude, 1.0/float64(ns))
// ...
for i := 0; i < ns; i++ {
sample := math.Sin(angle * freq * float64(i))
sample *= startAmplitude
startAmplitude *= decayFactor
var buf [4]byte
binary.LittleEndian.PutUint32(buf[:], math.Float32bits(float32(sample)))
if _, err := f.Write(buf[:]); err != nil {
panic(err.Error())
}
}
on running the script, you should notice that the audio fades off that the end - rec.mp4
we have generated the byte code that can produce some sound, let's save it
into a wave
file, so it can be played by the media players
rather than using ffplay
.
waveform audio file format or wave in short stores audio data as samples, along with some metadata such as number of audio channels (mono, stereo, etc.). a wave file is usually encoded using pulse code modulation (although, it isn't required to fully understand pulse code modulation to implement this blog post by yourself).
a wave file follows a strict format and is majorly split into three blocks of data.
the structure of the header is as follows: (on the left side, byte offsets are mentioned and on the right side, the corresponding data's label)
RIFF
, written in
little-endian. if it was written in big-endian, then it would be have
been RIFX
)
WAVE
)the structure of the "fmt" block is as follows:
fmt
)1
, if it is encoded via
pulse code modulation)
the structure of the data block is as follows:
data
)
here is a better pictorial representation of the structure -
https://ccrma.stanford.edu/courses/422/projects/WaveFormat/. if you wanna go a bit depth, then you play around by opening a
wave
file in hex editors like
ImHex.
let's first code out some structs adhering the above mentioned format.
package types
type Sample float64
type WaveHeader struct {
ChunkId []byte
ChunkSize int
}
type WaveFmt struct {
SubChunk1Id []byte
SubChunk1Size int
AudioFormat int
NumOfChannels int
SampleRate int
ByteRate int
BlockAlign int
BitsPerSample int
}
before coding out an implementation for WaveWriter
, let's
make a few utility functions which would convert ints/floats to their
little-endian byte representations.
func IntToBits(i int, size int) []byte {
switch size {
case 16:
return Int16ToBits(i)
case 32:
return Int32ToBits(i)
default:
panic("invalid size. only 16 and 32 bits are accepted")
}
}
func Int16ToBits(i int) []byte {
b := make([]byte, 2)
binary.LittleEndian.PutUint16(b, uint16(i))
return b
}
func Int32ToBits(i int) []byte {
b := make([]byte, 4)
binary.LittleEndian.PutUint32(b, uint32(i))
return b
}
func FloatToBits(f float64, size int) []byte {
bits := math.Float64bits(f)
b := make([]byte, 8)
binary.LittleEndian.PutUint64(b, bits)
switch size {
case 2:
return b[:2]
case 4:
return b[:4]
}
return b
}
and a few more utility functions which would convert
WaveHeader
and WaveFmt
into their equivalent
litte-endian byte representation.
func WaveFmtToBits(wfmt types.WaveFmt) []byte {
var b []byte
b = append(b, wfmt.SubChunk1Id...)
b = append(b, Int32ToBits(wfmt.SubChunk1Size)...)
b = append(b, Int16ToBits(wfmt.AudioFormat)...)
b = append(b, Int16ToBits(wfmt.NumOfChannels)...)
b = append(b, Int32ToBits(wfmt.SampleRate)...)
b = append(b, Int32ToBits(wfmt.ByteRate)...)
b = append(b, Int16ToBits(wfmt.BlockAlign)...)
b = append(b, Int16ToBits(wfmt.BitsPerSample)...)
return b
}
func SamplesToBits(samples []types.Sample, wfmt types.WaveFmt) ([]byte, error) {
var b []byte
for _, s := range samples {
var multiplier int
switch wfmt.BitsPerSample {
case 8:
multiplier = math.MaxInt8
case 16:
multiplier = math.MaxInt16
case 32:
multiplier = math.MaxInt32
case 64:
multiplier = math.MaxInt64
default:
return nil, fmt.Errorf("invalid size - %d, must be 8, 16, 32 or 64-bits only", wfmt.BitsPerSample)
}
bits := IntToBits(int(float64(s)*float64(multiplier)), wfmt.BitsPerSample)
b = append(b, bits...)
}
return b, nil
}
func CreateHeaderBits(samples []types.Sample, wfmt types.WaveFmt) []byte {
var b []byte
chunkSizeInBits := Int32ToBits(36 + (len(samples)*wfmt.NumOfChannels*wfmt.BitsPerSample)/8)
b = append(b, []byte(constants.WaveChunkId)...)
b = append(b, chunkSizeInBits...)
b = append(b, []byte(constants.WaveFileFormat)...)
return b
}
now as we have all the utility functions set up, let's write a simple
struct
WaveWriter
which implements a method
WriteWaveFile
which would save the sample data into a wave
file.
type WaveWriter struct{}
func NewWaveWriter() WaveWriter {
return WaveWriter{}
}
func (w WaveWriter) WriteWaveFile(file string, samples []types.Sample, metadata types.WaveFmt) error {
f, err := os.Create(file)
if err != nil {
return err
}
defer f.Close()
var data []byte
headerBits := utils.CreateHeaderBits(samples, metadata)
data = append(data, headerBits...)
wfmtInBits := utils.WaveFmtToBits(metadata)
data = append(data, wfmtInBits...)
data = append(data, []byte(constants.WaveSubChunk2Id)...)
data = append(data, utils.Int32ToBits(len(samples)*metadata.NumOfChannels*metadata.BitsPerSample/8)...)
samplesBits, err := utils.SamplesToBits(samples, metadata)
if err != nil {
return err
}
data = append(data, samplesBits...)
if _, err := f.Write(data); err != nil {
return err
}
return nil
}
the header is created using the above utility function
(CreateHeaderBits
) and also the fmt block and samples are
converted to their equivalent byte representations using the above utility
functions (WaveFmtToBits
and SamplesToBits
).
let's use the WriteWaveFile
method in our script to save the
samples into a wave
file.
var samples []types.Sample
for i := 0; i < ns; i++ {
sample := types.Sample(math.Sin(angle*freq*float64(i)) * startAmplitude)
startAmplitude *= decayFactor
samples = append(samples, sample)
}
waveWriter := helpers.NewWaveWriter()
if err := waveWriter.WriteWaveFile("test.wav", samples, wavefmt); err != nil {
panic(err.Error())
}
and on running the script, a new file named test.wav
must be
created, which must sound similar to -
test.wav
we've successfully implemented the first part of the blog post which is to
generate a sine wave of a constant frequency and save it to a wave file.
the next part is to amplify a given wave file. to amplify a given input
wave file, we have to first parse through it. to do so, we have to
implement
WaveReader
.
before actually implementing WaveReader
, we have add few
utility functions which convert bits to ints/floats.
func BitsToInt(b []byte, size int) int {
switch size {
case 16:
return Bits16ToInt(b)
case 32:
return Bits32ToInt(b)
default:
panic("invalid size. only 16 and 32 bits are accepted")
}
}
func Bits16ToInt(b []byte) int {
if len(b) != 2 {
panic(fmt.Errorf("invalid size. expected 2, got %d", len(b)))
}
var payload int16
buf := bytes.NewReader(b)
if err := binary.Read(buf, binary.LittleEndian, &payload); err != nil {
panic(err.Error())
}
return int(payload)
}
func Bits32ToInt(b []byte) int {
if len(b) != 4 {
panic(fmt.Errorf("invalid size. expected 4, got %d", len(b)))
}
var payload int32
buf := bytes.NewReader(b)
if err := binary.Read(buf, binary.LittleEndian, &payload); err != nil {
panic(err.Error())
}
return int(payload)
}
func BitsToFloat(b []byte) float64 {
switch len(b) {
case 4:
bits32 := binary.LittleEndian.Uint32(b)
return float64(math.Float32frombits(bits32))
case 8:
bits64 := binary.LittleEndian.Uint64(b)
return math.Float64frombits(bits64)
default:
panic(fmt.Errorf("invalid size: %d, must be 32 or 64 bits", len(b)*8))
}
}
the WaveReader
struct implements a few methods which parses
through the input wave
file and returns parsed header
(WaveHeader
), fmt block (WaveFmt
) and samples
([]Samples
). while parsing, the samples are scaled down as
during writing, the samples were multiplied with max value of that
corresponding bit length so it must be divided while parsing to maintain
consistency.
type WaveReader struct{}
func NewWaveReader() WaveReader {
return WaveReader{}
}
func (r WaveReader) ParseFile(file string) (types.Wave, error) {
f, err := os.Open(file)
if err != nil {
return types.Wave{}, err
}
defer f.Close()
data, err := io.ReadAll(f)
if err != nil {
return types.Wave{}, err
}
header, err := r.parseHeader(data)
if err != nil {
return types.Wave{}, err
}
wavefmt, err := r.parseMetadata(data)
if err != nil {
return types.Wave{}, err
}
samples, err := r.parseData(data)
if err != nil {
return types.Wave{}, err
}
wave := types.Wave{
WaveHeader: header,
WaveFmt: wavefmt,
Samples: samples,
}
return wave, nil
}
func (r WaveReader) parseHeader(data []byte) (types.WaveHeader, error) {
header := types.WaveHeader{}
chunkId := data[0:4]
if string(chunkId) != constants.WaveChunkId {
return header, errors.New("invalid file")
}
header.ChunkId = chunkId
chunkSize := data[4:8]
header.ChunkSize = utils.Bits32ToInt(chunkSize)
format := data[8:12]
if string(format) != constants.WaveFileFormat {
return header, errors.New("invalid format")
}
return header, nil
}
func (r WaveReader) parseMetadata(data []byte) (types.WaveFmt, error) {
metadata := types.WaveFmt{}
subChunk1Id := data[12:16]
if string(subChunk1Id) != constants.WaveSubChunk1Id {
return metadata, fmt.Errorf("invalid sub chunk 1 id - %s", string(subChunk1Id))
}
metadata.SubChunk1Id = subChunk1Id
metadata.SubChunk1Size = utils.Bits32ToInt(data[16:20])
metadata.AudioFormat = utils.Bits16ToInt(data[20:22])
metadata.NumOfChannels = utils.Bits16ToInt(data[22:24])
metadata.SampleRate = utils.Bits32ToInt(data[24:28])
metadata.ByteRate = utils.Bits32ToInt(data[28:32])
metadata.BlockAlign = utils.Bits16ToInt(data[32:34])
metadata.BitsPerSample = utils.Bits16ToInt(data[34:36])
return metadata, nil
}
func (r WaveReader) parseData(data []byte) ([]types.Sample, error) {
metadata, err := r.parseMetadata(data)
if err != nil {
return nil, err
}
subChunk2Id := data[36:40]
if string(subChunk2Id) != constants.WaveSubChunk2Id {
return nil, fmt.Errorf("invalid sub chunk 2 id - %s", string(subChunk2Id))
}
bytesPerSampleSize := metadata.BitsPerSample / 8
rawData := data[44:]
samples := []types.Sample{}
for i := 0; i < len(rawData); i += bytesPerSampleSize {
rawSample := rawData[i : i+bytesPerSampleSize]
unscaledSample := utils.BitsToInt(rawSample, metadata.BitsPerSample)
scaledSample := types.Sample(float64(unscaledSample) / float64(utils.MaxValue(metadata.BitsPerSample)))
samples = append(samples, scaledSample)
}
return samples, nil
}
if you have noticed, i have added a new type Wave
which just
contains wave header, fmt block and samples - easier to access and return
the data through Wave
.
type Wave struct {
WaveHeader
WaveFmt
Samples []Sample
}
in simple words, amplification is nothing more than scaling up/down the amplitude of the wave at a given point. so basically, messing around with each individual sample.
waveReader := helpers.NewWaveReader()
waveWriter := helpers.NewWaveWriter()
wave, err := waveReader.ParseFile(input)
if err != nil {
return err
}
var updatedSamples []types.Sample
for _, sample := range wave.Samples {
updatedSample := types.Sample(float64(sample) * scaleFactor)
updatedSamples = append(updatedSamples, updatedSample)
}
if err := waveWriter.WriteWaveFile(output, updatedSamples, wave.WaveFmt); err != nil {
return err
}
[out-of-context] i have re-structured the entire program into a CLI using
cobra. generating the constant frequency
sine wave is handled by generate
command and amplification
part is handled amplify
command. it is not entire necessarily
to use cobra, it was more of a personal choice.
this is the output when a 440Hz pure sine wave is scaled down by 0.2 times - output.wav
in the context of audio programming, stereopanning refers to positioning the sound within the space, allowing you to make it appear as if it's coming from the left speaker, right speaker, or anywhere in between by adjusting the audio signal.
until now, we have only worked with single audio channels but in case of multiple audio channels, a single float number doesn't make up a sample but rather multiple float numbers make up a sample.
[f1][f2][f3][f4]...[fn]; where fn is the nth float number
if the number of audio channels is set to be 1, then f1 alone makes up for the 1st sample. whereas, if the number of audio channels is set to be 2, then f1 and f2 combined make up for the 1st sample and f3 and f4 combined make up for the 2nd sample.
for sake of simplicity, we would perform stereopanning on a mono audio file (i.e. number of audio channels = 1) and return a stereo audio file (i.e. number of audio channels = 2).
generally, the position of audio in the space is represented using a number within the range of [-1, 1].
consider p
to be the panning position in range of [-1, 1],
then the multiplying factor for both the audio signals can be found by
transforming the ranges.
for left channel, [-1, 1] is transformed to [-0.5, 0.5] by dividing by 2 and then subtracted by 0.5 to get [-1, 0]
p
→ p
/2 → (p
/2) - 0.5
for right channel, [-1, 1] is transformed to [-0.5, 0.5] and then 0.5 is added to get [0, 1]
p
→ p
/2 → (p
/2) + 0.5
in code, it can be expressed as follows
func PanPositionToChanMultipliers(p float64) (float64, float64) {
if !(p >= -1 && p <= 1) {
panic("pan position outside [-1, 1] range")
}
leftChanMultiplier := (p / 2) - 0.5
rightChanMultiplier := (p / 2) + 0.5
return leftChanMultiplier, rightChanMultiplier
}
ik, not so great function name
and after figuring out the multipliers for left and right channels, it is
pretty much same as amplifying -- just multiplying that factor and writing
that data to the wave
file.
wave, err := waveReader.ParseFile(input)
if err != nil {
return err
}
leftChanMultiplier, rightChanMultiplier := utils.PanPositionToChanMultipliers(panningPosition)
var updatedSamples []types.Sample
for _, sample := range wave.Samples {
updatedSamples = append(updatedSamples, types.Sample(sample.ToFloat()*leftChanMultiplier))
updatedSamples = append(updatedSamples, types.Sample(sample.ToFloat()*rightChanMultiplier))
}
wave.WaveFmt.NumOfChannels = 2
if err := waveWriter.WriteWaveFile(output, updatedSamples, wave.WaveFmt); err != nil {
return err
}
i ran the script on rec.wav and here is the output when the panning position is equal to -1 - left.wav and here is the output when it is equal to 1 - right.wav
you can clearly notice that when you play left.wav
, the audio
just comes from the left speaker and when you play right.wav
,
it just comes from the right speaker.
...and well that is pretty it for this blog post. i might write few more blog posts about this topic covering topics such as waveform tables and ADSR.
got any thoughts related to this blog post? drop 'em over here - discussion
source code