mir_eval Documentation

mir_eval is a Python library which provides a transparent, standaridized, and straightforward way to evaluate Music Information Retrieval systems.

If you use mir_eval in a research project, please cite the following paper:

  1. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis, “mir_eval: A Transparent Implementation of Common MIR Metrics”, Proceedings of the 15th International Conference on Music Information Retrieval, 2014.

Installing mir_eval

The simplest way to install mir_eval is by using pip, which will also install the required dependencies if needed. To install mir_eval using pip, simply run

pip install mir_eval

Alternatively, you can install mir_eval from source by first installing the dependencies and then running

python setup.py install

from the source directory.

If you don’t use Python and want to get started as quickly as possible, you might consider using Anaconda which makes it easy to install a Python environment which can run mir_eval.

Using mir_eval

Once you’ve installed mir_eval (see Installing mir_eval), you can import it in your Python code as follows:

import mir_eval

From here, you will typically either load in data and call the evaluate() function from the appropriate submodule like so:

reference_beats = mir_eval.io.load_events('reference_beats.txt')
estimated_beats = mir_eval.io.load_events('estimated_beats.txt')
# Scores will be a dict containing scores for all of the metrics
# implemented in mir_eval.beat.  The keys are metric names
# and values are the scores achieved
scores = mir_eval.beat.evaluate(reference_beats, estimated_beats)

or you’ll load in the data, do some preprocessing, and call specific metric functions from the appropriate submodule like so:

reference_beats = mir_eval.io.load_events('reference_beats.txt')
estimated_beats = mir_eval.io.load_events('estimated_beats.txt')
# Crop out beats before 5s, a common preprocessing step
reference_beats = mir_eval.beat.trim_beats(reference_beats)
estimated_beats = mir_eval.beat.trim_beats(estimated_beats)
# Compute the F-measure metric and store it in f_measure
f_measure = mir_eval.beat.f_measure(reference_beats, estimated_beats)

The documentation for each metric function, found in the mir_eval section below, contains further usage information.

Alternatively, you can use the evaluator scripts which allow you to run evaluation from the command line, without writing any code. These scripts are are available here:

https://github.com/craffel/mir_evaluators

mir_eval

The structure of the mir_eval Python module is as follows: Each MIR task for which evaluation metrics are included in mir_eval is given its own submodule, and each metric is defined as a separate function in each submodule. Every metric function includes detailed documentation, example usage, input validation, and references to the original paper which defined the metric (see the subsections below). The task submodules also all contain a function evaluate(), which takes as input reference and estimated annotations and returns a dictionary of scores for all of the metrics implemented (for casual users, this is the place to start). Finally, each task submodule also includes functions for common data pre-processing steps.

mir_eval also includes the following additional submodules:

  • mir_eval.io which contains convenience functions for loading in task-specific data from common file formats

  • mir_eval.util which includes miscellaneous functionality shared across the submodules

  • mir_eval.sonify which implements some simple methods for synthesizing annotations of various formats for “evaluation by ear”.

  • mir_eval.display which provides functions for plotting annotations for various tasks.

The following subsections document each submodule.

mir_eval.alignment

Alignment models are given a sequence of events along with a piece of audio, and then return a sequence of timestamps, with one timestamp for each event, indicating the position of this event in the audio. The events are listed in order of occurrence in the audio, so that output timestamps have to be monotonically increasing. Evaluation usually involves taking the series of predicted and ground truth timestamps and comparing their distance, usually on a pair-wise basis, e.g. taking the median absolute error in seconds.

Conventions

Timestamps should be provided in the form of a 1-dimensional array of onset times in seconds in increasing order.

Metrics

where a timestamp is counted as correct if it lies within a certain tolerance window around the ground truth timestamp * mir_eval.alignment.pcs(): Percentage of correct segments: Percentage of overlap between predicted segments and ground truth segments, where segments are defined by (start time, end time) pairs * mir_eval.alignment.perceptual_metric(): metric based on human synchronicity perception as measured in the paper “User-centered evaluation of lyrics to audio alignment”, N. Lizé-Masclef, A. Vaglio, M. Moussallam, ISMIR 2021

References

1

N. Lizé-Masclef, A. Vaglio, M. Moussallam. “User-centered evaluation of lyrics to audio alignment”, International Society for Music Information Retrieval (ISMIR) conference, 2021.

2

M. Mauch, F: Hiromasa, M. Goto. “Lyrics-to-audio alignment and phrase-level segmentation using incomplete internet-style chord annotations”, Frontiers in Proceedings of the Sound Music Computing Conference (SMC), 2010.

3

G. Dzhambazov. “Knowledge-Based Probabilistic Modeling For Tracking Lyrics In Music Audio Signals”, PhD Thesis, 2017.

4

H. Fujihara, M. Goto, J. Ogata, H. Okuno. “LyricSynchronizer: Automatic synchronization system between musical audio signals and lyrics”, IEEE Journal of Selected Topics in Signal Processing, VOL. 5, NO. 6, 2011

mir_eval.alignment.validate(reference_timestamps: numpy.ndarray, estimated_timestamps: numpy.ndarray)

Checks that the input annotations to a metric look like valid onset time arrays, and throws helpful errors if not.

Parameters
reference_timestampsnp.ndarray

reference timestamp locations, in seconds

estimated_timestampsnp.ndarray

estimated timestamp locations, in seconds

mir_eval.alignment.absolute_error(reference_timestamps, estimated_timestamps)

Compute the absolute deviations between estimated and reference timestamps, and then returns the median and average over all events

Parameters
reference_timestampsnp.ndarray

reference timestamps, in seconds

estimated_timestampsnp.ndarray

estimated timestamps, in seconds

Returns
maefloat

Median absolute error

aae: float

Average absolute error

Examples

>>> reference_timestamps = mir_eval.io.load_events('reference.txt')
>>> estimated_timestamps = mir_eval.io.load_events('estimated.txt')
>>> mae, aae = mir_eval.align.absolute_error(reference_onsets, estimated_timestamps)
mir_eval.alignment.percentage_correct(reference_timestamps, estimated_timestamps, window=0.3)

Compute the percentage of correctly predicted timestamps. A timestamp is predicted correctly if its position doesn’t deviate more than the window parameter from the ground truth timestamp.

Parameters
reference_timestampsnp.ndarray

reference timestamps, in seconds

estimated_timestampsnp.ndarray

estimated timestamps, in seconds

windowfloat

Window size, in seconds (Default value = .3)

Returns
pcfloat

Percentage of correct timestamps

Examples

>>> reference_timestamps = mir_eval.io.load_events('reference.txt')
>>> estimated_timestamps = mir_eval.io.load_events('estimated.txt')
>>> pc = mir_eval.align.percentage_correct(reference_onsets, estimated_timestamps, window=0.2)
mir_eval.alignment.percentage_correct_segments(reference_timestamps, estimated_timestamps, duration: Optional[float] = None)

Calculates the percentage of correct segments (PCS) metric.

It constructs segments out of predicted and estimated timestamps separately out of each given timestamp vector and calculates the percentage of overlap between correct segments compared to the total duration.

WARNING: This metrics behaves differently depending on whether “duration” is given!

If duration is not given (default case), the computation follows the MIREX lyrics alignment challenge 2020. For a timestamp vector with entries (t1,t2, … tN), segments with the following (start, end) boundaries are created: (t1, t2), … (tN-1, tN). After the segments are created, the overlap between the reference and estimated segments is determined and divided by the total duration, which is the distance between the first and last timestamp in the reference.

If duration is given, the segment boundaries are instead (0, t1), (t1, t2), … (tN, duration). The overlap is computed in the same way, but then divided by the duration parameter given to this function. This method follows the original paper [#fujihara2011] more closely, where the metric was proposed. As a result, this variant of the metrics punishes cases where the first estimated timestamp is too early or the last estimated timestamp is too late, whereas the MIREX variant does not. On the other hand, the MIREX metric is invariant to how long the eventless beginning and end parts of the audio are, which might be a desirable property.

Parameters
reference_timestampsnp.ndarray

reference timestamps, in seconds

estimated_timestampsnp.ndarray

estimated timestamps, in seconds

durationfloat

Optional. Total duration of audio (seconds). WARNING: Metric is computed differently depending on whether this is provided or not - see documentation above!

Returns
pcsfloat

Percentage of time where ground truth and predicted segments overlap

Examples

>>> reference_timestamps = mir_eval.io.load_events('reference.txt')
>>> estimated_timestamps = mir_eval.io.load_events('estimated.txt')
>>> pcs = mir_eval.align.percentage_correct_segments(reference_timestamps, estimated_timestamps)
mir_eval.alignment.karaoke_perceptual_metric(reference_timestamps, estimated_timestamps)

Metric based on human synchronicity perception as measured in the paper “User-centered evaluation of lyrics to audio alignment” [#lizemasclef2021]

The parameters of this function were tuned on data collected through a user Karaoke-like experiment It reflects human judgment of how “synchronous” lyrics and audio stimuli are perceived in that setup. Beware that this metric is non-symmetrical and by construction it is also not equal to 1 at 0.

Parameters
reference_timestampsnp.ndarray

reference timestamps, in seconds

estimated_timestampsnp.ndarray

estimated timestamps, in seconds

Returns
perceptual_scorefloat

Perceptual score, averaged over all timestamps

Examples

>>> reference_timestamps = mir_eval.io.load_events('reference.txt')
>>> estimated_timestamps = mir_eval.io.load_events('estimated.txt')
>>> score = mir_eval.align.karaoke_perceptual_metric(reference_onsets, estimated_timestamps)
mir_eval.alignment.evaluate(reference_timestamps, estimated_timestamps, **kwargs)

Compute all metrics for the given reference and estimated annotations. Examples ——– >>> reference_timestamps = mir_eval.io.load_events(‘reference.txt’) >>> estimated_timestamps = mir_eval.io.load_events(‘estimated.txt’) >>> duration = max(np.max(reference_timestamps), np.max(estimated_timestamps)) + 10 >>> scores = mir_eval.align.evaluate(reference_onsets, estimated_timestamps, duration)

Parameters
reference_timestampsnp.ndarray

reference timestamp locations, in seconds

estimated_timestampsnp.ndarray

estimated timestamp locations, in seconds

kwargs

Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.

Returns
scoresdict

Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.

mir_eval.beat

The aim of a beat detection algorithm is to report the times at which a typical human listener might tap their foot to a piece of music. As a result, most metrics for evaluating the performance of beat tracking systems involve computing the error between the estimated beat times and some reference list of beat locations. Many metrics additionally compare the beat sequences at different metric levels in order to deal with the ambiguity of tempo.

Based on the methods described in:

Matthew E. P. Davies, Norberto Degara, and Mark D. Plumbley. “Evaluation Methods for Musical Audio Beat Tracking Algorithms”, Queen Mary University of London Technical Report C4DM-TR-09-06 London, United Kingdom, 8 October 2009.

See also the Beat Evaluation Toolbox:

https://code.soundsoftware.ac.uk/projects/beat-evaluation/

Conventions

Beat times should be provided in the form of a 1-dimensional array of beat times in seconds in increasing order. Typically, any beats which occur before 5s are ignored; this can be accomplished using mir_eval.beat.trim_beats().

Metrics

  • mir_eval.beat.f_measure(): The F-measure of the beat sequence, where an estimated beat is considered correct if it is sufficiently close to a reference beat

  • mir_eval.beat.cemgil(): Cemgil’s score, which computes the sum of Gaussian errors for each beat

  • mir_eval.beat.goto(): Goto’s score, a binary score which is 1 when at least 25% of the estimated beat sequence closely matches the reference beat sequence

  • mir_eval.beat.p_score(): McKinney’s P-score, which computes the cross-correlation of the estimated and reference beat sequences represented as impulse trains

  • mir_eval.beat.continuity(): Continuity-based scores which compute the proportion of the beat sequence which is continuously correct

  • mir_eval.beat.information_gain(): The Information Gain of a normalized beat error histogram over a uniform distribution

mir_eval.beat.trim_beats(beats, min_beat_time=5.0)

Removes beats before min_beat_time. A common preprocessing step.

Parameters
beatsnp.ndarray

Array of beat times in seconds.

min_beat_timefloat

Minimum beat time to allow (Default value = 5.)

Returns
beats_trimmednp.ndarray

Trimmed beat array.

mir_eval.beat.validate(reference_beats, estimated_beats)

Checks that the input annotations to a metric look like valid beat time arrays, and throws helpful errors if not.

Parameters
reference_beatsnp.ndarray

reference beat times, in seconds

estimated_beatsnp.ndarray

estimated beat times, in seconds

mir_eval.beat.f_measure(reference_beats, estimated_beats, f_measure_threshold=0.07)

Compute the F-measure of correct vs incorrectly predicted beats. “Correctness” is determined over a small window.

Parameters
reference_beatsnp.ndarray

reference beat times, in seconds

estimated_beatsnp.ndarray

estimated beat times, in seconds

f_measure_thresholdfloat

Window size, in seconds (Default value = 0.07)

Returns
f_scorefloat

The computed F-measure score

Examples

>>> reference_beats = mir_eval.io.load_events('reference.txt')
>>> reference_beats = mir_eval.beat.trim_beats(reference_beats)
>>> estimated_beats = mir_eval.io.load_events('estimated.txt')
>>> estimated_beats = mir_eval.beat.trim_beats(estimated_beats)
>>> f_measure = mir_eval.beat.f_measure(reference_beats,
                                        estimated_beats)
mir_eval.beat.cemgil(reference_beats, estimated_beats, cemgil_sigma=0.04)

Cemgil’s score, computes a gaussian error of each estimated beat. Compares against the original beat times and all metrical variations.

Parameters
reference_beatsnp.ndarray

reference beat times, in seconds

estimated_beatsnp.ndarray

query beat times, in seconds

cemgil_sigmafloat

Sigma parameter of gaussian error windows (Default value = 0.04)

Returns
cemgil_scorefloat

Cemgil’s score for the original reference beats

cemgil_maxfloat

The best Cemgil score for all metrical variations

Examples

>>> reference_beats = mir_eval.io.load_events('reference.txt')
>>> reference_beats = mir_eval.beat.trim_beats(reference_beats)
>>> estimated_beats = mir_eval.io.load_events('estimated.txt')
>>> estimated_beats = mir_eval.beat.trim_beats(estimated_beats)
>>> cemgil_score, cemgil_max = mir_eval.beat.cemgil(reference_beats,
                                                    estimated_beats)
mir_eval.beat.goto(reference_beats, estimated_beats, goto_threshold=0.35, goto_mu=0.2, goto_sigma=0.2)

Calculate Goto’s score, a binary 1 or 0 depending on some specific heuristic criteria

Parameters
reference_beatsnp.ndarray

reference beat times, in seconds

estimated_beatsnp.ndarray

query beat times, in seconds

goto_thresholdfloat

Threshold of beat error for a beat to be “correct” (Default value = 0.35)

goto_mufloat

The mean of the beat errors in the continuously correct track must be less than this (Default value = 0.2)

goto_sigmafloat

The std of the beat errors in the continuously correct track must be less than this (Default value = 0.2)

Returns
goto_scorefloat

Either 1.0 or 0.0 if some specific criteria are met

Examples

>>> reference_beats = mir_eval.io.load_events('reference.txt')
>>> reference_beats = mir_eval.beat.trim_beats(reference_beats)
>>> estimated_beats = mir_eval.io.load_events('estimated.txt')
>>> estimated_beats = mir_eval.beat.trim_beats(estimated_beats)
>>> goto_score = mir_eval.beat.goto(reference_beats, estimated_beats)
mir_eval.beat.p_score(reference_beats, estimated_beats, p_score_threshold=0.2)

Get McKinney’s P-score. Based on the autocorrelation of the reference and estimated beats

Parameters
reference_beatsnp.ndarray

reference beat times, in seconds

estimated_beatsnp.ndarray

query beat times, in seconds

p_score_thresholdfloat

Window size will be p_score_threshold*np.median(inter_annotation_intervals), (Default value = 0.2)

Returns
correlationfloat

McKinney’s P-score

Examples

>>> reference_beats = mir_eval.io.load_events('reference.txt')
>>> reference_beats = mir_eval.beat.trim_beats(reference_beats)
>>> estimated_beats = mir_eval.io.load_events('estimated.txt')
>>> estimated_beats = mir_eval.beat.trim_beats(estimated_beats)
>>> p_score = mir_eval.beat.p_score(reference_beats, estimated_beats)
mir_eval.beat.continuity(reference_beats, estimated_beats, continuity_phase_threshold=0.175, continuity_period_threshold=0.175)

Get metrics based on how much of the estimated beat sequence is continually correct.

Parameters
reference_beatsnp.ndarray

reference beat times, in seconds

estimated_beatsnp.ndarray

query beat times, in seconds

continuity_phase_thresholdfloat

Allowable ratio of how far is the estimated beat can be from the reference beat (Default value = 0.175)

continuity_period_thresholdfloat

Allowable distance between the inter-beat-interval and the inter-annotation-interval (Default value = 0.175)

Returns
CMLcfloat

Correct metric level, continuous accuracy

CMLtfloat

Correct metric level, total accuracy (continuity not required)

AMLcfloat

Any metric level, continuous accuracy

AMLtfloat

Any metric level, total accuracy (continuity not required)

Examples

>>> reference_beats = mir_eval.io.load_events('reference.txt')
>>> reference_beats = mir_eval.beat.trim_beats(reference_beats)
>>> estimated_beats = mir_eval.io.load_events('estimated.txt')
>>> estimated_beats = mir_eval.beat.trim_beats(estimated_beats)
>>> CMLc, CMLt, AMLc, AMLt = mir_eval.beat.continuity(reference_beats,
                                                      estimated_beats)
mir_eval.beat.information_gain(reference_beats, estimated_beats, bins=41)

Get the information gain - K-L divergence of the beat error histogram to a uniform histogram

Parameters
reference_beatsnp.ndarray

reference beat times, in seconds

estimated_beatsnp.ndarray

query beat times, in seconds

binsint

Number of bins in the beat error histogram (Default value = 41)

Returns
information_gain_scorefloat

Entropy of beat error histogram

Examples

>>> reference_beats = mir_eval.io.load_events('reference.txt')
>>> reference_beats = mir_eval.beat.trim_beats(reference_beats)
>>> estimated_beats = mir_eval.io.load_events('estimated.txt')
>>> estimated_beats = mir_eval.beat.trim_beats(estimated_beats)
>>> information_gain = mir_eval.beat.information_gain(reference_beats,
                                                      estimated_beats)
mir_eval.beat.evaluate(reference_beats, estimated_beats, **kwargs)

Compute all metrics for the given reference and estimated annotations.

Parameters
reference_beatsnp.ndarray

Reference beat times, in seconds

estimated_beatsnp.ndarray

Query beat times, in seconds

kwargs

Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.

Returns
scoresdict

Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.

Examples

>>> reference_beats = mir_eval.io.load_events('reference.txt')
>>> estimated_beats = mir_eval.io.load_events('estimated.txt')
>>> scores = mir_eval.beat.evaluate(reference_beats, estimated_beats)

mir_eval.chord

Chord estimation algorithms produce a list of intervals and labels which denote the chord being played over each timespan. They are evaluated by comparing the estimated chord labels to some reference, usually using a mapping to a chord subalphabet (e.g. minor and major chords only, all triads, etc.). There is no single ‘right’ way to compare two sequences of chord labels. Embracing this reality, every conventional comparison rule is provided. Comparisons are made over the different components of each chord (e.g. G:maj(6)/5): the root (G), the root-invariant active semitones as determined by the quality shorthand (maj) and scale degrees (6), and the bass interval (5). This submodule provides functions both for comparing a sequences of chord labels according to some chord subalphabet mapping and for using these comparisons to score a sequence of estimated chords against a reference.

Conventions

A sequence of chord labels is represented as a list of strings, where each label is the chord name based on the syntax of 5. Reference and estimated chord label sequences should be of the same length for comparison functions. When converting the chord string into its constituent parts,

  • Pitch class counting starts at C, e.g. C:0, D:2, E:4, F:5, etc.

  • Scale degree is represented as a string of the diatonic interval, relative to the root note, e.g. ‘b6’, ‘#5’, or ‘7’

  • Bass intervals are represented as strings

  • Chord bitmaps are positional binary vectors indicating active pitch classes and may be absolute or relative depending on context in the code.

If no chord is present at a given point in time, it should have the label ‘N’, which is defined in the variable mir_eval.chord.NO_CHORD.

Metrics

  • mir_eval.chord.root(): Only compares the root of the chords.

  • mir_eval.chord.majmin(): Only compares major, minor, and “no chord” labels.

  • mir_eval.chord.majmin_inv(): Compares major/minor chords, with inversions. The bass note must exist in the triad.

  • mir_eval.chord.mirex(): A estimated chord is considered correct if it shares at least three pitch classes in common.

  • mir_eval.chord.thirds(): Chords are compared at the level of major or minor thirds (root and third), For example, both (‘A:7’, ‘A:maj’) and (‘A:min’, ‘A:dim’) are equivalent, as the third is major and minor in quality, respectively.

  • mir_eval.chord.thirds_inv(): Same as above, with inversions (bass relationships).

  • mir_eval.chord.triads(): Chords are considered at the level of triads (major, minor, augmented, diminished, suspended), meaning that, in addition to the root, the quality is only considered through #5th scale degree (for augmented chords). For example, (‘A:7’, ‘A:maj’) are equivalent, while (‘A:min’, ‘A:dim’) and (‘A:aug’, ‘A:maj’) are not.

  • mir_eval.chord.triads_inv(): Same as above, with inversions (bass relationships).

  • mir_eval.chord.tetrads(): Chords are considered at the level of the entire quality in closed voicing, i.e. spanning only a single octave; extended chords (9’s, 11’s and 13’s) are rolled into a single octave with any upper voices included as extensions. For example, (‘A:7’, ‘A:9’) are equivlent but (‘A:7’, ‘A:maj7’) are not.

  • mir_eval.chord.tetrads_inv(): Same as above, with inversions (bass relationships).

  • mir_eval.chord.sevenths(): Compares according to MIREX “sevenths” rules; that is, only major, major seventh, seventh, minor, minor seventh and no chord labels are compared.

  • mir_eval.chord.sevenths_inv(): Same as above, with inversions (bass relationships).

  • mir_eval.chord.overseg(): Computes the level of over-segmentation between estimated and reference intervals.

  • mir_eval.chord.underseg(): Computes the level of under-segmentation between estimated and reference intervals.

  • mir_eval.chord.seg(): Computes the minimum of over- and under-segmentation between estimated and reference intervals.

References

5(1,2)

C. Harte. Towards Automatic Extraction of Harmony Information from Music Signals. PhD thesis, Queen Mary University of London, August 2010.

exception mir_eval.chord.InvalidChordException(message='', chord_label=None)

Bases: Exception

Exception class for suspect / invalid chord labels

mir_eval.chord.pitch_class_to_semitone(pitch_class)

Convert a pitch class to semitone.

Parameters
pitch_classstr

Spelling of a given pitch class, e.g. ‘C#’, ‘Gbb’

Returns
semitoneint

Semitone value of the pitch class.

mir_eval.chord.scale_degree_to_semitone(scale_degree)

Convert a scale degree to semitone.

Parameters
scale degreestr

Spelling of a relative scale degree, e.g. ‘b3’, ‘7’, ‘#5’

Returns
semitoneint

Relative semitone of the scale degree, wrapped to a single octave

Raises
InvalidChordException if scale_degree is invalid.
mir_eval.chord.scale_degree_to_bitmap(scale_degree, modulo=False, length=12)

Create a bitmap representation of a scale degree.

Note that values in the bitmap may be negative, indicating that the semitone is to be removed.

Parameters
scale_degreestr

Spelling of a relative scale degree, e.g. ‘b3’, ‘7’, ‘#5’

modulobool, default=True

If a scale degree exceeds the length of the bit-vector, modulo the scale degree back into the bit-vector; otherwise it is discarded.

lengthint, default=12

Length of the bit-vector to produce

Returns
bitmapnp.ndarray, in [-1, 0, 1], len=`length`

Bitmap representation of this scale degree.

mir_eval.chord.quality_to_bitmap(quality)

Return the bitmap for a given quality.

Parameters
qualitystr

Chord quality name.

Returns
bitmapnp.ndarray

Bitmap representation of this quality (12-dim).

mir_eval.chord.reduce_extended_quality(quality)

Map an extended chord quality to a simpler one, moving upper voices to a set of scale degree extensions.

Parameters
qualitystr

Extended chord quality to reduce.

Returns
base_qualitystr

New chord quality.

extensionsset

Scale degrees extensions for the quality.

mir_eval.chord.validate_chord_label(chord_label)

Test for well-formedness of a chord label.

Parameters
chordstr

Chord label to validate.

mir_eval.chord.split(chord_label, reduce_extended_chords=False)
Parse a chord label into its four constituent parts:
  • root

  • quality shorthand

  • scale degrees

  • bass

Note: Chords lacking quality AND interval information are major.
  • If a quality is specified, it is returned.

  • If an interval is specified WITHOUT a quality, the quality field is empty.

Some examples:

'C' -> ['C', 'maj', {}, '1']
'G#:min(*b3,*5)/5' -> ['G#', 'min', {'*b3', '*5'}, '5']
'A:(3)/6' -> ['A', '', {'3'}, '6']
Parameters
chord_labelstr

A chord label.

reduce_extended_chordsbool

Whether to map the upper voicings of extended chords (9’s, 11’s, 13’s) to semitone extensions. (Default value = False)

Returns
chord_partslist

Split version of the chord label.

mir_eval.chord.join(chord_root, quality='', extensions=None, bass='')

Join the parts of a chord into a complete chord label.

Parameters
chord_rootstr

Root pitch class of the chord, e.g. ‘C’, ‘Eb’

qualitystr

Quality of the chord, e.g. ‘maj’, ‘hdim7’ (Default value = ‘’)

extensionslist

Any added or absent scaled degrees for this chord, e.g. [‘4’, ‘*3’] (Default value = None)

bassstr

Scale degree of the bass note, e.g. ‘5’. (Default value = ‘’)

Returns
chord_labelstr

A complete chord label.

mir_eval.chord.encode(chord_label, reduce_extended_chords=False, strict_bass_intervals=False)

Translate a chord label to numerical representations for evaluation.

Parameters
chord_labelstr

Chord label to encode.

reduce_extended_chordsbool

Whether to map the upper voicings of extended chords (9’s, 11’s, 13’s) to semitone extensions. (Default value = False)

strict_bass_intervalsbool

Whether to require that the bass scale degree is present in the chord. (Default value = False)

Returns
root_numberint

Absolute semitone of the chord’s root.

semitone_bitmapnp.ndarray, dtype=int

12-dim vector of relative semitones in the chord spelling.

bass_numberint

Relative semitone of the chord’s bass note, e.g. 0=root, 7=fifth, etc.

mir_eval.chord.encode_many(chord_labels, reduce_extended_chords=False)

Translate a set of chord labels to numerical representations for sane evaluation.

Parameters
chord_labelslist

Set of chord labels to encode.

reduce_extended_chordsbool

Whether to map the upper voicings of extended chords (9’s, 11’s, 13’s) to semitone extensions. (Default value = False)

Returns
root_numbernp.ndarray, dtype=int

Absolute semitone of the chord’s root.

interval_bitmapnp.ndarray, dtype=int

12-dim vector of relative semitones in the given chord quality.

bass_numbernp.ndarray, dtype=int

Relative semitones of the chord’s bass notes.

mir_eval.chord.rotate_bitmap_to_root(bitmap, chord_root)

Circularly shift a relative bitmap to its asbolute pitch classes.

For clarity, the best explanation is an example. Given ‘G:Maj’, the root and quality map are as follows:

root=5
quality=[1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0]  # Relative chord shape

After rotating to the root, the resulting bitmap becomes:

abs_quality = [0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1]  # G, B, and D
Parameters
bitmapnp.ndarray, shape=(12,)

Bitmap of active notes, relative to the given root.

chord_rootint

Absolute pitch class number.

Returns
bitmapnp.ndarray, shape=(12,)

Absolute bitmap of active pitch classes.

mir_eval.chord.rotate_bitmaps_to_roots(bitmaps, roots)

Circularly shift a relative bitmaps to asbolute pitch classes.

See rotate_bitmap_to_root() for more information.

Parameters
bitmapnp.ndarray, shape=(N, 12)

Bitmap of active notes, relative to the given root.

rootnp.ndarray, shape=(N,)

Absolute pitch class number.

Returns
bitmapnp.ndarray, shape=(N, 12)

Absolute bitmaps of active pitch classes.

mir_eval.chord.validate(reference_labels, estimated_labels)

Checks that the input annotations to a comparison function look like valid chord labels.

Parameters
reference_labelslist, len=n

Reference chord labels to score against.

estimated_labelslist, len=n

Estimated chord labels to score against.

mir_eval.chord.weighted_accuracy(comparisons, weights)

Compute the weighted accuracy of a list of chord comparisons.

Parameters
comparisonsnp.ndarray

List of chord comparison scores, in [0, 1] or -1

weightsnp.ndarray

Weights (not necessarily normalized) for each comparison. This can be a list of interval durations

Returns
scorefloat

Weighted accuracy

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> est_intervals, est_labels = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, ref_intervals.min(),
...     ref_intervals.max(), mir_eval.chord.NO_CHORD,
...     mir_eval.chord.NO_CHORD)
>>> (intervals,
...  ref_labels,
...  est_labels) = mir_eval.util.merge_labeled_intervals(
...      ref_intervals, ref_labels, est_intervals, est_labels)
>>> durations = mir_eval.util.intervals_to_durations(intervals)
>>> # Here, we're using the "thirds" function to compare labels
>>> # but any of the comparison functions would work.
>>> comparisons = mir_eval.chord.thirds(ref_labels, est_labels)
>>> score = mir_eval.chord.weighted_accuracy(comparisons, durations)
mir_eval.chord.thirds(reference_labels, estimated_labels)

Compare chords along root & third relationships.

Parameters
reference_labelslist, len=n

Reference chord labels to score against.

estimated_labelslist, len=n

Estimated chord labels to score against.

Returns
comparison_scoresnp.ndarray, shape=(n,), dtype=float

Comparison scores, in [0.0, 1.0]

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> est_intervals, est_labels = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, ref_intervals.min(),
...     ref_intervals.max(), mir_eval.chord.NO_CHORD,
...     mir_eval.chord.NO_CHORD)
>>> (intervals,
...  ref_labels,
...  est_labels) = mir_eval.util.merge_labeled_intervals(
...      ref_intervals, ref_labels, est_intervals, est_labels)
>>> durations = mir_eval.util.intervals_to_durations(intervals)
>>> comparisons = mir_eval.chord.thirds(ref_labels, est_labels)
>>> score = mir_eval.chord.weighted_accuracy(comparisons, durations)
mir_eval.chord.thirds_inv(reference_labels, estimated_labels)

Score chords along root, third, & bass relationships.

Parameters
reference_labelslist, len=n

Reference chord labels to score against.

estimated_labelslist, len=n

Estimated chord labels to score against.

Returns
scoresnp.ndarray, shape=(n,), dtype=float

Comparison scores, in [0.0, 1.0]

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> est_intervals, est_labels = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, ref_intervals.min(),
...     ref_intervals.max(), mir_eval.chord.NO_CHORD,
...     mir_eval.chord.NO_CHORD)
>>> (intervals,
...  ref_labels,
...  est_labels) = mir_eval.util.merge_labeled_intervals(
...      ref_intervals, ref_labels, est_intervals, est_labels)
>>> durations = mir_eval.util.intervals_to_durations(intervals)
>>> comparisons = mir_eval.chord.thirds_inv(ref_labels, est_labels)
>>> score = mir_eval.chord.weighted_accuracy(comparisons, durations)
mir_eval.chord.triads(reference_labels, estimated_labels)

Compare chords along triad (root & quality to #5) relationships.

Parameters
reference_labelslist, len=n

Reference chord labels to score against.

estimated_labelslist, len=n

Estimated chord labels to score against.

Returns
comparison_scoresnp.ndarray, shape=(n,), dtype=float

Comparison scores, in [0.0, 1.0]

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> est_intervals, est_labels = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, ref_intervals.min(),
...     ref_intervals.max(), mir_eval.chord.NO_CHORD,
...     mir_eval.chord.NO_CHORD)
>>> (intervals,
...  ref_labels,
...  est_labels) = mir_eval.util.merge_labeled_intervals(
...      ref_intervals, ref_labels, est_intervals, est_labels)
>>> durations = mir_eval.util.intervals_to_durations(intervals)
>>> comparisons = mir_eval.chord.triads(ref_labels, est_labels)
>>> score = mir_eval.chord.weighted_accuracy(comparisons, durations)
mir_eval.chord.triads_inv(reference_labels, estimated_labels)

Score chords along triad (root, quality to #5, & bass) relationships.

Parameters
reference_labelslist, len=n

Reference chord labels to score against.

estimated_labelslist, len=n

Estimated chord labels to score against.

Returns
scoresnp.ndarray, shape=(n,), dtype=float

Comparison scores, in [0.0, 1.0]

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> est_intervals, est_labels = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, ref_intervals.min(),
...     ref_intervals.max(), mir_eval.chord.NO_CHORD,
...     mir_eval.chord.NO_CHORD)
>>> (intervals,
...  ref_labels,
...  est_labels) = mir_eval.util.merge_labeled_intervals(
...      ref_intervals, ref_labels, est_intervals, est_labels)
>>> durations = mir_eval.util.intervals_to_durations(intervals)
>>> comparisons = mir_eval.chord.triads_inv(ref_labels, est_labels)
>>> score = mir_eval.chord.weighted_accuracy(comparisons, durations)
mir_eval.chord.tetrads(reference_labels, estimated_labels)

Compare chords along tetrad (root & full quality) relationships.

Parameters
reference_labelslist, len=n

Reference chord labels to score against.

estimated_labelslist, len=n

Estimated chord labels to score against.

Returns
comparison_scoresnp.ndarray, shape=(n,), dtype=float

Comparison scores, in [0.0, 1.0]

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> est_intervals, est_labels = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, ref_intervals.min(),
...     ref_intervals.max(), mir_eval.chord.NO_CHORD,
...     mir_eval.chord.NO_CHORD)
>>> (intervals,
...  ref_labels,
...  est_labels) = mir_eval.util.merge_labeled_intervals(
...      ref_intervals, ref_labels, est_intervals, est_labels)
>>> durations = mir_eval.util.intervals_to_durations(intervals)
>>> comparisons = mir_eval.chord.tetrads(ref_labels, est_labels)
>>> score = mir_eval.chord.weighted_accuracy(comparisons, durations)
mir_eval.chord.tetrads_inv(reference_labels, estimated_labels)

Compare chords along tetrad (root, full quality, & bass) relationships.

Parameters
reference_labelslist, len=n

Reference chord labels to score against.

estimated_labelslist, len=n

Estimated chord labels to score against.

Returns
comparison_scoresnp.ndarray, shape=(n,), dtype=float

Comparison scores, in [0.0, 1.0]

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> est_intervals, est_labels = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, ref_intervals.min(),
...     ref_intervals.max(), mir_eval.chord.NO_CHORD,
...     mir_eval.chord.NO_CHORD)
>>> (intervals,
...  ref_labels,
...  est_labels) = mir_eval.util.merge_labeled_intervals(
...      ref_intervals, ref_labels, est_intervals, est_labels)
>>> durations = mir_eval.util.intervals_to_durations(intervals)
>>> comparisons = mir_eval.chord.tetrads_inv(ref_labels, est_labels)
>>> score = mir_eval.chord.weighted_accuracy(comparisons, durations)
mir_eval.chord.root(reference_labels, estimated_labels)

Compare chords according to roots.

Parameters
reference_labelslist, len=n

Reference chord labels to score against.

estimated_labelslist, len=n

Estimated chord labels to score against.

Returns
comparison_scoresnp.ndarray, shape=(n,), dtype=float

Comparison scores, in [0.0, 1.0], or -1 if the comparison is out of gamut.

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> est_intervals, est_labels = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, ref_intervals.min(),
...     ref_intervals.max(), mir_eval.chord.NO_CHORD,
...     mir_eval.chord.NO_CHORD)
>>> (intervals,
...  ref_labels,
...  est_labels) = mir_eval.util.merge_labeled_intervals(
...      ref_intervals, ref_labels, est_intervals, est_labels)
>>> durations = mir_eval.util.intervals_to_durations(intervals)
>>> comparisons = mir_eval.chord.root(ref_labels, est_labels)
>>> score = mir_eval.chord.weighted_accuracy(comparisons, durations)
mir_eval.chord.mirex(reference_labels, estimated_labels)

Compare chords along MIREX rules.

Parameters
reference_labelslist, len=n

Reference chord labels to score against.

estimated_labelslist, len=n

Estimated chord labels to score against.

Returns
comparison_scoresnp.ndarray, shape=(n,), dtype=float

Comparison scores, in [0.0, 1.0]

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> est_intervals, est_labels = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, ref_intervals.min(),
...     ref_intervals.max(), mir_eval.chord.NO_CHORD,
...     mir_eval.chord.NO_CHORD)
>>> (intervals,
...  ref_labels,
...  est_labels) = mir_eval.util.merge_labeled_intervals(
...      ref_intervals, ref_labels, est_intervals, est_labels)
>>> durations = mir_eval.util.intervals_to_durations(intervals)
>>> comparisons = mir_eval.chord.mirex(ref_labels, est_labels)
>>> score = mir_eval.chord.weighted_accuracy(comparisons, durations)
mir_eval.chord.majmin(reference_labels, estimated_labels)

Compare chords along major-minor rules. Chords with qualities outside Major/minor/no-chord are ignored.

Parameters
reference_labelslist, len=n

Reference chord labels to score against.

estimated_labelslist, len=n

Estimated chord labels to score against.

Returns
comparison_scoresnp.ndarray, shape=(n,), dtype=float

Comparison scores, in [0.0, 1.0], or -1 if the comparison is out of gamut.

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> est_intervals, est_labels = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, ref_intervals.min(),
...     ref_intervals.max(), mir_eval.chord.NO_CHORD,
...     mir_eval.chord.NO_CHORD)
>>> (intervals,
...  ref_labels,
...  est_labels) = mir_eval.util.merge_labeled_intervals(
...      ref_intervals, ref_labels, est_intervals, est_labels)
>>> durations = mir_eval.util.intervals_to_durations(intervals)
>>> comparisons = mir_eval.chord.majmin(ref_labels, est_labels)
>>> score = mir_eval.chord.weighted_accuracy(comparisons, durations)
mir_eval.chord.majmin_inv(reference_labels, estimated_labels)

Compare chords along major-minor rules, with inversions. Chords with qualities outside Major/minor/no-chord are ignored, and the bass note must exist in the triad (bass in [1, 3, 5]).

Parameters
reference_labelslist, len=n

Reference chord labels to score against.

estimated_labelslist, len=n

Estimated chord labels to score against.

Returns
comparison_scoresnp.ndarray, shape=(n,), dtype=float

Comparison scores, in [0.0, 1.0], or -1 if the comparison is out of gamut.

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> est_intervals, est_labels = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, ref_intervals.min(),
...     ref_intervals.max(), mir_eval.chord.NO_CHORD,
...     mir_eval.chord.NO_CHORD)
>>> (intervals,
...  ref_labels,
...  est_labels) = mir_eval.util.merge_labeled_intervals(
...      ref_intervals, ref_labels, est_intervals, est_labels)
>>> durations = mir_eval.util.intervals_to_durations(intervals)
>>> comparisons = mir_eval.chord.majmin_inv(ref_labels, est_labels)
>>> score = mir_eval.chord.weighted_accuracy(comparisons, durations)
mir_eval.chord.sevenths(reference_labels, estimated_labels)

Compare chords along MIREX ‘sevenths’ rules. Chords with qualities outside [maj, maj7, 7, min, min7, N] are ignored.

Parameters
reference_labelslist, len=n

Reference chord labels to score against.

estimated_labelslist, len=n

Estimated chord labels to score against.

Returns
comparison_scoresnp.ndarray, shape=(n,), dtype=float

Comparison scores, in [0.0, 1.0], or -1 if the comparison is out of gamut.

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> est_intervals, est_labels = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, ref_intervals.min(),
...     ref_intervals.max(), mir_eval.chord.NO_CHORD,
...     mir_eval.chord.NO_CHORD)
>>> (intervals,
...  ref_labels,
...  est_labels) = mir_eval.util.merge_labeled_intervals(
...      ref_intervals, ref_labels, est_intervals, est_labels)
>>> durations = mir_eval.util.intervals_to_durations(intervals)
>>> comparisons = mir_eval.chord.sevenths(ref_labels, est_labels)
>>> score = mir_eval.chord.weighted_accuracy(comparisons, durations)
mir_eval.chord.sevenths_inv(reference_labels, estimated_labels)

Compare chords along MIREX ‘sevenths’ rules. Chords with qualities outside [maj, maj7, 7, min, min7, N] are ignored.

Parameters
reference_labelslist, len=n

Reference chord labels to score against.

estimated_labelslist, len=n

Estimated chord labels to score against.

Returns
comparison_scoresnp.ndarray, shape=(n,), dtype=float

Comparison scores, in [0.0, 1.0], or -1 if the comparison is out of gamut.

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> est_intervals, est_labels = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, ref_intervals.min(),
...     ref_intervals.max(), mir_eval.chord.NO_CHORD,
...     mir_eval.chord.NO_CHORD)
>>> (intervals,
...  ref_labels,
...  est_labels) = mir_eval.util.merge_labeled_intervals(
...      ref_intervals, ref_labels, est_intervals, est_labels)
>>> durations = mir_eval.util.intervals_to_durations(intervals)
>>> comparisons = mir_eval.chord.sevenths_inv(ref_labels, est_labels)
>>> score = mir_eval.chord.weighted_accuracy(comparisons, durations)
mir_eval.chord.directional_hamming_distance(reference_intervals, estimated_intervals)

Compute the directional hamming distance between reference and estimated intervals as defined by 5 and used for MIREX ‘OverSeg’, ‘UnderSeg’ and ‘MeanSeg’ measures.

Parameters
reference_intervalsnp.ndarray, shape=(n, 2), dtype=float

Reference chord intervals to score against.

estimated_intervalsnp.ndarray, shape=(m, 2), dtype=float

Estimated chord intervals to score against.

Returns
directional hamming distancefloat

directional hamming distance between reference intervals and estimated intervals.

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> overseg = 1 - mir_eval.chord.directional_hamming_distance(
...     ref_intervals, est_intervals)
>>> underseg = 1 - mir_eval.chord.directional_hamming_distance(
...     est_intervals, ref_intervals)
>>> seg = min(overseg, underseg)
mir_eval.chord.overseg(reference_intervals, estimated_intervals)

Compute the MIREX ‘OverSeg’ score.

Parameters
reference_intervalsnp.ndarray, shape=(n, 2), dtype=float

Reference chord intervals to score against.

estimated_intervalsnp.ndarray, shape=(m, 2), dtype=float

Estimated chord intervals to score against.

Returns
oversegmentation scorefloat

Comparison score, in [0.0, 1.0], where 1.0 means no oversegmentation.

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> score = mir_eval.chord.overseg(ref_intervals, est_intervals)
mir_eval.chord.underseg(reference_intervals, estimated_intervals)

Compute the MIREX ‘UnderSeg’ score.

Parameters
reference_intervalsnp.ndarray, shape=(n, 2), dtype=float

Reference chord intervals to score against.

estimated_intervalsnp.ndarray, shape=(m, 2), dtype=float

Estimated chord intervals to score against.

Returns
undersegmentation scorefloat

Comparison score, in [0.0, 1.0], where 1.0 means no undersegmentation.

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> score = mir_eval.chord.underseg(ref_intervals, est_intervals)
mir_eval.chord.seg(reference_intervals, estimated_intervals)

Compute the MIREX ‘MeanSeg’ score.

Parameters
reference_intervalsnp.ndarray, shape=(n, 2), dtype=float

Reference chord intervals to score against.

estimated_intervalsnp.ndarray, shape=(m, 2), dtype=float

Estimated chord intervals to score against.

Returns
segmentation scorefloat

Comparison score, in [0.0, 1.0], where 1.0 means perfect segmentation.

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> score = mir_eval.chord.seg(ref_intervals, est_intervals)
mir_eval.chord.merge_chord_intervals(intervals, labels)

Merge consecutive chord intervals if they represent the same chord.

Parameters
intervalsnp.ndarray, shape=(n, 2), dtype=float

Chord intervals to be merged, in the format returned by mir_eval.io.load_labeled_intervals().

labelslist, shape=(n,)

Chord labels to be merged, in the format returned by mir_eval.io.load_labeled_intervals().

Returns
merged_ivsnp.ndarray, shape=(k, 2), dtype=float

Merged chord intervals, k <= n

mir_eval.chord.evaluate(ref_intervals, ref_labels, est_intervals, est_labels, **kwargs)

Computes weighted accuracy for all comparison functions for the given reference and estimated annotations.

Parameters
ref_intervalsnp.ndarray, shape=(n, 2)

Reference chord intervals, in the format returned by mir_eval.io.load_labeled_intervals().

ref_labelslist, shape=(n,)

reference chord labels, in the format returned by mir_eval.io.load_labeled_intervals().

est_intervalsnp.ndarray, shape=(m, 2)

estimated chord intervals, in the format returned by mir_eval.io.load_labeled_intervals().

est_labelslist, shape=(m,)

estimated chord labels, in the format returned by mir_eval.io.load_labeled_intervals().

kwargs

Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.

Returns
scoresdict

Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> scores = mir_eval.chord.evaluate(ref_intervals, ref_labels,
...                                  est_intervals, est_labels)

mir_eval.melody

Melody extraction algorithms aim to produce a sequence of frequency values corresponding to the pitch of the dominant melody from a musical recording. For evaluation, an estimated pitch series is evaluated against a reference based on whether the voicing (melody present or not) and the pitch is correct (within some tolerance).

For a detailed explanation of the measures please refer to:

J. Salamon, E. Gomez, D. P. W. Ellis and G. Richard, “Melody Extraction from Polyphonic Music Signals: Approaches, Applications and Challenges”, IEEE Signal Processing Magazine, 31(2):118-134, Mar. 2014.

and:

G. E. Poliner, D. P. W. Ellis, A. F. Ehmann, E. Gomez, S. Streich, and B. Ong. “Melody transcription from music audio: Approaches and evaluation”, IEEE Transactions on Audio, Speech, and Language Processing, 15(4):1247-1256, 2007.

For an explanation of the generalized measures (using non-binary voicings), please refer to:

R. Bittner and J. Bosch, “Generalized Metrics for Single-F0 Estimation Evaluation”, International Society for Music Information Retrieval Conference (ISMIR), 2019.

Conventions

Melody annotations are assumed to be given in the format of a 1d array of frequency values which are accompanied by a 1d array of times denoting when each frequency value occurs. In a reference melody time series, a frequency value of 0 denotes “unvoiced”. In a estimated melody time series, unvoiced frames can be indicated either by 0 Hz or by a negative Hz value - negative values represent the algorithm’s pitch estimate for frames it has determined as unvoiced, in case they are in fact voiced.

Metrics are computed using a sequence of reference and estimated pitches in cents and voicing arrays, both of which are sampled to the same timebase. The function mir_eval.melody.to_cent_voicing() can be used to convert a sequence of estimated and reference times and frequency values in Hz to voicing arrays and frequency arrays in the format required by the metric functions. By default, the convention is to resample the estimated melody time series to the reference melody time series’ timebase.

Metrics

  • mir_eval.melody.voicing_measures(): Voicing measures, including the recall rate (proportion of frames labeled as melody frames in the reference that are estimated as melody frames) and the false alarm rate (proportion of frames labeled as non-melody in the reference that are mistakenly estimated as melody frames)

  • mir_eval.melody.raw_pitch_accuracy(): Raw Pitch Accuracy, which computes the proportion of melody frames in the reference for which the frequency is considered correct (i.e. within half a semitone of the reference frequency)

  • mir_eval.melody.raw_chroma_accuracy(): Raw Chroma Accuracy, where the estimated and reference frequency sequences are mapped onto a single octave before computing the raw pitch accuracy

  • mir_eval.melody.overall_accuracy(): Overall Accuracy, which computes the proportion of all frames correctly estimated by the algorithm, including whether non-melody frames where labeled by the algorithm as non-melody

mir_eval.melody.validate_voicing(ref_voicing, est_voicing)

Checks that voicing inputs to a metric are in the correct format.

Parameters
ref_voicingnp.ndarray

Reference voicing array

est_voicingnp.ndarray

Estimated voicing array

mir_eval.melody.validate(ref_voicing, ref_cent, est_voicing, est_cent)

Checks that voicing and frequency arrays are well-formed. To be used in conjunction with mir_eval.melody.validate_voicing()

Parameters
ref_voicingnp.ndarray

Reference voicing array

ref_centnp.ndarray

Reference pitch sequence in cents

est_voicingnp.ndarray

Estimated voicing array

est_centnp.ndarray

Estimate pitch sequence in cents

mir_eval.melody.hz2cents(freq_hz, base_frequency=10.0)

Convert an array of frequency values in Hz to cents. 0 values are left in place.

Parameters
freq_hznp.ndarray

Array of frequencies in Hz.

base_frequencyfloat

Base frequency for conversion. (Default value = 10.0)

Returns
——-
freq_centnp.ndarray

Array of frequencies in cents, relative to base_frequency

mir_eval.melody.freq_to_voicing(frequencies, voicing=None)

Convert from an array of frequency values to frequency array + voice/unvoiced array

Parameters
frequenciesnp.ndarray

Array of frequencies. A frequency <= 0 indicates “unvoiced”.

voicingnp.ndarray

Array of voicing values. (Default value = None) Default None, which means the voicing is inferred from frequencies:

frames with frequency <= 0.0 are considered “unvoiced” frames with frequency > 0.0 are considered “voiced”

If specified, voicing is used as the voicing array, but frequencies with value 0 are forced to have 0 voicing.

Voicing inferred by negative frequency values is ignored.

Returns
frequenciesnp.ndarray

Array of frequencies, all >= 0.

voicednp.ndarray

Array of voicings between 0 and 1, same length as frequencies, which indicates voiced or unvoiced

mir_eval.melody.constant_hop_timebase(hop, end_time)

Generates a time series from 0 to end_time with times spaced hop apart

Parameters
hopfloat

Spacing of samples in the time series

end_timefloat

Time series will span [0, end_time]

Returns
timesnp.ndarray

Generated timebase

mir_eval.melody.resample_melody_series(times, frequencies, voicing, times_new, kind='linear')

Resamples frequency and voicing time series to a new timescale. Maintains any zero (“unvoiced”) values in frequencies.

If times and times_new are equivalent, no resampling will be performed.

Parameters
timesnp.ndarray

Times of each frequency value

frequenciesnp.ndarray

Array of frequency values, >= 0

voicingnp.ndarray

Array which indicates voiced or unvoiced. This array may be binary or have continuous values between 0 and 1.

times_newnp.ndarray

Times to resample frequency and voicing sequences to

kindstr

kind parameter to pass to scipy.interpolate.interp1d. (Default value = ‘linear’)

Returns
frequencies_resamplednp.ndarray

Frequency array resampled to new timebase

voicing_resamplednp.ndarray

Voicing array resampled to new timebase

mir_eval.melody.to_cent_voicing(ref_time, ref_freq, est_time, est_freq, est_voicing=None, ref_reward=None, base_frequency=10.0, hop=None, kind='linear')

Converts reference and estimated time/frequency (Hz) annotations to sampled frequency (cent)/voicing arrays.

A zero frequency indicates “unvoiced”.

If est_voicing is not provided, a negative frequency indicates:
“Predicted as unvoiced, but if it’s voiced,

this is the frequency estimate”.

If it is provided, negative frequency values are ignored, and the voicing

from est_voicing is directly used.

Parameters
ref_timenp.ndarray

Time of each reference frequency value

ref_freqnp.ndarray

Array of reference frequency values

est_timenp.ndarray

Time of each estimated frequency value

est_freqnp.ndarray

Array of estimated frequency values

est_voicingnp.ndarray

Estimate voicing confidence. Default None, which means the voicing is inferred from est_freq:

frames with frequency <= 0.0 are considered “unvoiced” frames with frequency > 0.0 are considered “voiced”

ref_rewardnp.ndarray

Reference voicing reward. Default None, which means all frames are weighted equally.

base_frequencyfloat

Base frequency in Hz for conversion to cents (Default value = 10.)

hopfloat

Hop size, in seconds, to resample, default None which means use ref_time

kindstr

kind parameter to pass to scipy.interpolate.interp1d. (Default value = ‘linear’)

Returns
ref_voicingnp.ndarray

Resampled reference voicing array

ref_centnp.ndarray

Resampled reference frequency (cent) array

est_voicingnp.ndarray

Resampled estimated voicing array

est_centnp.ndarray

Resampled estimated frequency (cent) array

mir_eval.melody.voicing_recall(ref_voicing, est_voicing)

Compute the voicing recall given two voicing indicator sequences, one as reference (truth) and the other as the estimate (prediction). The sequences must be of the same length. Examples ——– >>> ref_time, ref_freq = mir_eval.io.load_time_series(‘ref.txt’) >>> est_time, est_freq = mir_eval.io.load_time_series(‘est.txt’) >>> (ref_v, ref_c, … est_v, est_c) = mir_eval.melody.to_cent_voicing(ref_time, … ref_freq, … est_time, … est_freq) >>> recall = mir_eval.melody.voicing_recall(ref_v, est_v) Parameters ———- ref_voicing : np.ndarray

Reference boolean voicing array

est_voicingnp.ndarray

Estimated boolean voicing array

vx_recallfloat

Voicing recall rate, the fraction of voiced frames in ref indicated as voiced in est

mir_eval.melody.voicing_false_alarm(ref_voicing, est_voicing)

Compute the voicing false alarm rates given two voicing indicator sequences, one as reference (truth) and the other as the estimate (prediction). The sequences must be of the same length. Examples ——– >>> ref_time, ref_freq = mir_eval.io.load_time_series(‘ref.txt’) >>> est_time, est_freq = mir_eval.io.load_time_series(‘est.txt’) >>> (ref_v, ref_c, … est_v, est_c) = mir_eval.melody.to_cent_voicing(ref_time, … ref_freq, … est_time, … est_freq) >>> false_alarm = mir_eval.melody.voicing_false_alarm(ref_v, est_v) Parameters ———- ref_voicing : np.ndarray

Reference boolean voicing array

est_voicingnp.ndarray

Estimated boolean voicing array

vx_false_alarmfloat

Voicing false alarm rate, the fraction of unvoiced frames in ref indicated as voiced in est

mir_eval.melody.voicing_measures(ref_voicing, est_voicing)

Compute the voicing recall and false alarm rates given two voicing indicator sequences, one as reference (truth) and the other as the estimate (prediction). The sequences must be of the same length. Examples ——– >>> ref_time, ref_freq = mir_eval.io.load_time_series(‘ref.txt’) >>> est_time, est_freq = mir_eval.io.load_time_series(‘est.txt’) >>> (ref_v, ref_c, … est_v, est_c) = mir_eval.melody.to_cent_voicing(ref_time, … ref_freq, … est_time, … est_freq) >>> recall, false_alarm = mir_eval.melody.voicing_measures(ref_v, … est_v) Parameters ———- ref_voicing : np.ndarray

Reference boolean voicing array

est_voicingnp.ndarray

Estimated boolean voicing array

vx_recallfloat

Voicing recall rate, the fraction of voiced frames in ref indicated as voiced in est

vx_false_alarmfloat

Voicing false alarm rate, the fraction of unvoiced frames in ref indicated as voiced in est

mir_eval.melody.raw_pitch_accuracy(ref_voicing, ref_cent, est_voicing, est_cent, cent_tolerance=50)

Compute the raw pitch accuracy given two pitch (frequency) sequences in cents and matching voicing indicator sequences. The first pitch and voicing arrays are treated as the reference (truth), and the second two as the estimate (prediction). All 4 sequences must be of the same length.

Parameters
ref_voicingnp.ndarray

Reference voicing array. When this array is non-binary, it is treated as a ‘reference reward’, as in (Bittner & Bosch, 2019)

ref_centnp.ndarray

Reference pitch sequence in cents

est_voicingnp.ndarray

Estimated voicing array

est_centnp.ndarray

Estimate pitch sequence in cents

cent_tolerancefloat

Maximum absolute deviation in cents for a frequency value to be considered correct (Default value = 50)

Returns
raw_pitchfloat

Raw pitch accuracy, the fraction of voiced frames in ref_cent for which est_cent provides a correct frequency values (within cent_tolerance cents).

Examples

>>> ref_time, ref_freq = mir_eval.io.load_time_series('ref.txt')
>>> est_time, est_freq = mir_eval.io.load_time_series('est.txt')
>>> (ref_v, ref_c,
...  est_v, est_c) = mir_eval.melody.to_cent_voicing(ref_time,
...                                                  ref_freq,
...                                                  est_time,
...                                                  est_freq)
>>> raw_pitch = mir_eval.melody.raw_pitch_accuracy(ref_v, ref_c,
...                                                est_v, est_c)
mir_eval.melody.raw_chroma_accuracy(ref_voicing, ref_cent, est_voicing, est_cent, cent_tolerance=50)

Compute the raw chroma accuracy given two pitch (frequency) sequences in cents and matching voicing indicator sequences. The first pitch and voicing arrays are treated as the reference (truth), and the second two as the estimate (prediction). All 4 sequences must be of the same length.

Parameters
ref_voicingnp.ndarray

Reference voicing array. When this array is non-binary, it is treated as a ‘reference reward’, as in (Bittner & Bosch, 2019)

ref_centnp.ndarray

Reference pitch sequence in cents

est_voicingnp.ndarray

Estimated voicing array

est_centnp.ndarray

Estimate pitch sequence in cents

cent_tolerancefloat

Maximum absolute deviation in cents for a frequency value to be considered correct (Default value = 50)

Returns
raw_chromafloat

Raw chroma accuracy, the fraction of voiced frames in ref_cent for which est_cent provides a correct frequency values (within cent_tolerance cents), ignoring octave errors

Examples

>>> ref_time, ref_freq = mir_eval.io.load_time_series('ref.txt')
>>> est_time, est_freq = mir_eval.io.load_time_series('est.txt')
>>> (ref_v, ref_c,
...  est_v, est_c) = mir_eval.melody.to_cent_voicing(ref_time,
...                                                  ref_freq,
...                                                  est_time,
...                                                  est_freq)
>>> raw_chroma = mir_eval.melody.raw_chroma_accuracy(ref_v, ref_c,
...                                                  est_v, est_c)
mir_eval.melody.overall_accuracy(ref_voicing, ref_cent, est_voicing, est_cent, cent_tolerance=50)

Compute the overall accuracy given two pitch (frequency) sequences in cents and matching voicing indicator sequences. The first pitch and voicing arrays are treated as the reference (truth), and the second two as the estimate (prediction). All 4 sequences must be of the same length.

Parameters
ref_voicingnp.ndarray

Reference voicing array. When this array is non-binary, it is treated as a ‘reference reward’, as in (Bittner & Bosch, 2019)

ref_centnp.ndarray

Reference pitch sequence in cents

est_voicingnp.ndarray

Estimated voicing array

est_centnp.ndarray

Estimate pitch sequence in cents

cent_tolerancefloat

Maximum absolute deviation in cents for a frequency value to be considered correct (Default value = 50)

Returns
overall_accuracyfloat

Overall accuracy, the total fraction of correctly estimates frames, where provides a correct frequency values (within cent_tolerance).

Examples

>>> ref_time, ref_freq = mir_eval.io.load_time_series('ref.txt')
>>> est_time, est_freq = mir_eval.io.load_time_series('est.txt')
>>> (ref_v, ref_c,
...  est_v, est_c) = mir_eval.melody.to_cent_voicing(ref_time,
...                                                  ref_freq,
...                                                  est_time,
...                                                  est_freq)
>>> overall_accuracy = mir_eval.melody.overall_accuracy(ref_v, ref_c,
...                                                     est_v, est_c)
mir_eval.melody.evaluate(ref_time, ref_freq, est_time, est_freq, est_voicing=None, ref_reward=None, **kwargs)

Evaluate two melody (predominant f0) transcriptions, where the first is treated as the reference (ground truth) and the second as the estimate to be evaluated (prediction).

Parameters
ref_timenp.ndarray

Time of each reference frequency value

ref_freqnp.ndarray

Array of reference frequency values

est_timenp.ndarray

Time of each estimated frequency value

est_freqnp.ndarray

Array of estimated frequency values

est_voicingnp.ndarray

Estimate voicing confidence. Default None, which means the voicing is inferred from est_freq:

frames with frequency <= 0.0 are considered “unvoiced” frames with frequency > 0.0 are considered “voiced”

ref_rewardnp.ndarray

Reference pitch estimation reward. Default None, which means all frames are weighted equally.

kwargs

Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.

Returns
scoresdict

Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.

References

6

J. Salamon, E. Gomez, D. P. W. Ellis and G. Richard, “Melody Extraction from Polyphonic Music Signals: Approaches, Applications and Challenges”, IEEE Signal Processing Magazine, 31(2):118-134, Mar. 2014.

7

G. E. Poliner, D. P. W. Ellis, A. F. Ehmann, E. Gomez, S. Streich, and B. Ong. “Melody transcription from music audio: Approaches and evaluation”, IEEE Transactions on Audio, Speech, and Language Processing, 15(4):1247-1256, 2007.

8

R. Bittner and J. Bosch, “Generalized Metrics for Single-F0 Estimation Evaluation”, International Society for Music Information Retrieval Conference (ISMIR), 2019.

Examples

>>> ref_time, ref_freq = mir_eval.io.load_time_series('ref.txt')
>>> est_time, est_freq = mir_eval.io.load_time_series('est.txt')
>>> scores = mir_eval.melody.evaluate(ref_time, ref_freq,
...                                   est_time, est_freq)

mir_eval.multipitch

The goal of multiple f0 (multipitch) estimation and tracking is to identify all of the active fundamental frequencies in each time frame in a complex music signal.

Conventions

Multipitch estimates are represented by a timebase and a corresponding list of arrays of frequency estimates. Frequency estimates may have any number of frequency values, including 0 (represented by an empty array). Time values are in units of seconds and frequency estimates are in units of Hz.

The timebase of the estimate time series should ideally match the timebase of the reference time series, but if this is not the case, the estimate time series is resampled using a nearest neighbor interpolation to match the estimate. Time values in the estimate time series that are outside of the range of the reference time series are given null (empty array) frequencies.

By default, a frequency is “correct” if it is within 0.5 semitones of a reference frequency. Frequency values are compared by first mapping them to log-2 semitone space, where the distance between semitones is constant. Chroma-wrapped frequency values are computed by taking the log-2 frequency values modulo 12 to map them down to a single octave. A chroma-wrapped frequency estimate is correct if it’s single-octave value is within 0.5 semitones of the single-octave reference frequency.

The metrics are based on those described in 9 and 10.

Metrics

  • mir_eval.multipitch.metrics(): Precision, Recall, Accuracy, Substitution, Miss, False Alarm, and Total Error scores based both on raw frequency values and values mapped to a single octave (chroma).

References

9

G. E. Poliner, and D. P. W. Ellis, “A Discriminative Model for Polyphonic Piano Transription”, EURASIP Journal on Advances in Signal Processing, 2007(1):154-163, Jan. 2007.

10

Bay, M., Ehmann, A. F., & Downie, J. S. (2009). Evaluation of Multiple-F0 Estimation and Tracking Systems. In ISMIR (pp. 315-320).

mir_eval.multipitch.validate(ref_time, ref_freqs, est_time, est_freqs)

Checks that the time and frequency inputs are well-formed.

Parameters
ref_timenp.ndarray

reference time stamps in seconds

ref_freqslist of np.ndarray

reference frequencies in Hz

est_timenp.ndarray

estimate time stamps in seconds

est_freqslist of np.ndarray

estimated frequencies in Hz

mir_eval.multipitch.resample_multipitch(times, frequencies, target_times)

Resamples multipitch time series to a new timescale. Values in target_times outside the range of times return no pitch estimate.

Parameters
timesnp.ndarray

Array of time stamps

frequencieslist of np.ndarray

List of np.ndarrays of frequency values

target_timesnp.ndarray

Array of target time stamps

Returns
frequencies_resampledlist of numpy arrays

Frequency list of lists resampled to new timebase

mir_eval.multipitch.frequencies_to_midi(frequencies, ref_frequency=440.0)

Converts frequencies to continuous MIDI values.

Parameters
frequencieslist of np.ndarray

Original frequency values

ref_frequencyfloat

reference frequency in Hz.

Returns
frequencies_midilist of np.ndarray

Continuous MIDI frequency values.

mir_eval.multipitch.midi_to_chroma(frequencies_midi)

Wrap MIDI frequencies to a single octave (chroma).

Parameters
frequencies_midilist of np.ndarray

Continuous MIDI note frequency values.

Returns
frequencies_chromalist of np.ndarray

Midi values wrapped to one octave.

mir_eval.multipitch.compute_num_freqs(frequencies)

Computes the number of frequencies for each time point.

Parameters
frequencieslist of np.ndarray

Frequency values

Returns
num_freqsnp.ndarray

Number of frequencies at each time point.

mir_eval.multipitch.compute_num_true_positives(ref_freqs, est_freqs, window=0.5, chroma=False)

Compute the number of true positives in an estimate given a reference. A frequency is correct if it is within a quartertone of the correct frequency.

Parameters
ref_freqslist of np.ndarray

reference frequencies (MIDI)

est_freqslist of np.ndarray

estimated frequencies (MIDI)

windowfloat

Window size, in semitones

chromabool

If True, computes distances modulo n. If True, ref_freqs and est_freqs should be wrapped modulo n.

Returns
true_positivesnp.ndarray

Array the same length as ref_freqs containing the number of true positives.

mir_eval.multipitch.compute_accuracy(true_positives, n_ref, n_est)

Compute accuracy metrics.

Parameters
true_positivesnp.ndarray

Array containing the number of true positives at each time point.

n_refnp.ndarray

Array containing the number of reference frequencies at each time point.

n_estnp.ndarray

Array containing the number of estimate frequencies at each time point.

Returns
precisionfloat

sum(true_positives)/sum(n_est)

recallfloat

sum(true_positives)/sum(n_ref)

accfloat

sum(true_positives)/sum(n_est + n_ref - true_positives)

mir_eval.multipitch.compute_err_score(true_positives, n_ref, n_est)

Compute error score metrics.

Parameters
true_positivesnp.ndarray

Array containing the number of true positives at each time point.

n_refnp.ndarray

Array containing the number of reference frequencies at each time point.

n_estnp.ndarray

Array containing the number of estimate frequencies at each time point.

Returns
e_subfloat

Substitution error

e_missfloat

Miss error

e_fafloat

False alarm error

e_totfloat

Total error

mir_eval.multipitch.metrics(ref_time, ref_freqs, est_time, est_freqs, **kwargs)

Compute multipitch metrics. All metrics are computed at the ‘macro’ level such that the frame true positive/false positive/false negative rates are summed across time and the metrics are computed on the combined values.

Parameters
ref_timenp.ndarray

Time of each reference frequency value

ref_freqslist of np.ndarray

List of np.ndarrays of reference frequency values

est_timenp.ndarray

Time of each estimated frequency value

est_freqslist of np.ndarray

List of np.ndarrays of estimate frequency values

kwargs

Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.

Returns
precisionfloat

Precision (TP/(TP + FP))

recallfloat

Recall (TP/(TP + FN))

accuracyfloat

Accuracy (TP/(TP + FP + FN))

e_subfloat

Substitution error

e_missfloat

Miss error

e_fafloat

False alarm error

e_totfloat

Total error

precision_chromafloat

Chroma precision

recall_chromafloat

Chroma recall

accuracy_chromafloat

Chroma accuracy

e_sub_chromafloat

Chroma substitution error

e_miss_chromafloat

Chroma miss error

e_fa_chromafloat

Chroma false alarm error

e_tot_chromafloat

Chroma total error

Examples

>>> ref_time, ref_freqs = mir_eval.io.load_ragged_time_series(
...     'reference.txt')
>>> est_time, est_freqs = mir_eval.io.load_ragged_time_series(
...     'estimated.txt')
>>> metris_tuple = mir_eval.multipitch.metrics(
...     ref_time, ref_freqs, est_time, est_freqs)
mir_eval.multipitch.evaluate(ref_time, ref_freqs, est_time, est_freqs, **kwargs)

Evaluate two multipitch (multi-f0) transcriptions, where the first is treated as the reference (ground truth) and the second as the estimate to be evaluated (prediction).

Parameters
ref_timenp.ndarray

Time of each reference frequency value

ref_freqslist of np.ndarray

List of np.ndarrays of reference frequency values

est_timenp.ndarray

Time of each estimated frequency value

est_freqslist of np.ndarray

List of np.ndarrays of estimate frequency values

kwargs

Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.

Returns
scoresdict

Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.

Examples

>>> ref_time, ref_freq = mir_eval.io.load_ragged_time_series('ref.txt')
>>> est_time, est_freq = mir_eval.io.load_ragged_time_series('est.txt')
>>> scores = mir_eval.multipitch.evaluate(ref_time, ref_freq,
...                                       est_time, est_freq)

mir_eval.onset

The goal of an onset detection algorithm is to automatically determine when notes are played in a piece of music. The primary method used to evaluate onset detectors is to first determine which estimated onsets are “correct”, where correctness is defined as being within a small window of a reference onset.

Based in part on this script:

Conventions

Onsets should be provided in the form of a 1-dimensional array of onset times in seconds in increasing order.

Metrics

  • mir_eval.onset.f_measure(): Precision, Recall, and F-measure scores based on the number of esimated onsets which are sufficiently close to reference onsets.

mir_eval.onset.validate(reference_onsets, estimated_onsets)

Checks that the input annotations to a metric look like valid onset time arrays, and throws helpful errors if not.

Parameters
reference_onsetsnp.ndarray

reference onset locations, in seconds

estimated_onsetsnp.ndarray

estimated onset locations, in seconds

mir_eval.onset.f_measure(reference_onsets, estimated_onsets, window=0.05)

Compute the F-measure of correct vs incorrectly predicted onsets. “Corectness” is determined over a small window.

Parameters
reference_onsetsnp.ndarray

reference onset locations, in seconds

estimated_onsetsnp.ndarray

estimated onset locations, in seconds

windowfloat

Window size, in seconds (Default value = .05)

Returns
f_measurefloat

2*precision*recall/(precision + recall)

precisionfloat

(# true positives)/(# true positives + # false positives)

recallfloat

(# true positives)/(# true positives + # false negatives)

Examples

>>> reference_onsets = mir_eval.io.load_events('reference.txt')
>>> estimated_onsets = mir_eval.io.load_events('estimated.txt')
>>> F, P, R = mir_eval.onset.f_measure(reference_onsets,
...                                    estimated_onsets)
mir_eval.onset.evaluate(reference_onsets, estimated_onsets, **kwargs)

Compute all metrics for the given reference and estimated annotations.

Parameters
reference_onsetsnp.ndarray

reference onset locations, in seconds

estimated_onsetsnp.ndarray

estimated onset locations, in seconds

kwargs

Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.

Returns
scoresdict

Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.

Examples

>>> reference_onsets = mir_eval.io.load_events('reference.txt')
>>> estimated_onsets = mir_eval.io.load_events('estimated.txt')
>>> scores = mir_eval.onset.evaluate(reference_onsets,
...                                  estimated_onsets)

mir_eval.pattern

Pattern discovery involves the identification of musical patterns (i.e. short fragments or melodic ideas that repeat at least twice) both from audio and symbolic representations. The metrics used to evaluate pattern discovery systems attempt to quantify the ability of the algorithm to not only determine the present patterns in a piece, but also to find all of their occurrences.

Based on the methods described here:

T. Collins. MIREX task: Discovery of repeated themes & sections. http://www.music-ir.org/mirex/wiki/2013:Discovery_of_Repeated_Themes_&_Sections, 2013.

Conventions

The input format can be automatically generated by calling mir_eval.io.load_patterns(). This format is a list of a list of tuples. The first list collections patterns, each of which is a list of occurences, and each occurrence is a list of MIDI onset tuples of (onset_time, mid_note)

A pattern is a list of occurrences. The first occurrence must be the prototype of that pattern (i.e. the most representative of all the occurrences). An occurrence is a list of tuples containing the onset time and the midi note number.

Metrics

  • mir_eval.pattern.standard_FPR(): Strict metric in order to find the possibly transposed patterns of exact length. This is the only metric that considers transposed patterns.

  • mir_eval.pattern.establishment_FPR(): Evaluates the amount of patterns that were successfully identified by the estimated results, no matter how many occurrences they found. In other words, this metric captures how the algorithm successfully established that a pattern repeated at least twice, and this pattern is also found in the reference annotation.

  • mir_eval.pattern.occurrence_FPR(): Evaluation of how well an estimation can effectively identify all the occurrences of the found patterns, independently of how many patterns have been discovered. This metric has a threshold parameter that indicates how similar two occurrences must be in order to be considered equal. In MIREX, this evaluation is run twice, with thresholds .75 and .5.

  • mir_eval.pattern.three_layer_FPR(): Aims to evaluate the general similarity between the reference and the estimations, combining both the establishment of patterns and the retrieval of its occurrences in a single F1 score.

  • mir_eval.pattern.first_n_three_layer_P(): Computes the three-layer precision for the first N patterns only in order to measure the ability of the algorithm to sort the identified patterns based on their relevance.

  • mir_eval.pattern.first_n_target_proportion_R(): Computes the target proportion recall for the first N patterns only in order to measure the ability of the algorithm to sort the identified patterns based on their relevance.

mir_eval.pattern.validate(reference_patterns, estimated_patterns)

Checks that the input annotations to a metric look like valid pattern lists, and throws helpful errors if not.

Parameters
reference_patternslist

The reference patterns using the format returned by mir_eval.io.load_patterns()

estimated_patternslist

The estimated patterns in the same format

Returns
mir_eval.pattern.standard_FPR(reference_patterns, estimated_patterns, tol=1e-05)

Standard F1 Score, Precision and Recall.

This metric checks if the prototype patterns of the reference match possible translated patterns in the prototype patterns of the estimations. Since the sizes of these prototypes must be equal, this metric is quite restictive and it tends to be 0 in most of 2013 MIREX results.

Parameters
reference_patternslist

The reference patterns using the format returned by mir_eval.io.load_patterns()

estimated_patternslist

The estimated patterns in the same format

tolfloat

Tolerance level when comparing reference against estimation. Default parameter is the one found in the original matlab code by Tom Collins used for MIREX 2013. (Default value = 1e-5)

Returns
f_measurefloat

The standard F1 Score

precisionfloat

The standard Precision

recallfloat

The standard Recall

Examples

>>> ref_patterns = mir_eval.io.load_patterns("ref_pattern.txt")
>>> est_patterns = mir_eval.io.load_patterns("est_pattern.txt")
>>> F, P, R = mir_eval.pattern.standard_FPR(ref_patterns, est_patterns)
mir_eval.pattern.establishment_FPR(reference_patterns, estimated_patterns, similarity_metric='cardinality_score')

Establishment F1 Score, Precision and Recall.

Parameters
reference_patternslist

The reference patterns in the format returned by mir_eval.io.load_patterns()

estimated_patternslist

The estimated patterns in the same format

similarity_metricstr

A string representing the metric to be used when computing the similarity matrix. Accepted values:

  • “cardinality_score”: Count of the intersection between occurrences.

(Default value = “cardinality_score”)

Returns
f_measurefloat

The establishment F1 Score

precisionfloat

The establishment Precision

recallfloat

The establishment Recall

Examples

>>> ref_patterns = mir_eval.io.load_patterns("ref_pattern.txt")
>>> est_patterns = mir_eval.io.load_patterns("est_pattern.txt")
>>> F, P, R = mir_eval.pattern.establishment_FPR(ref_patterns,
...                                              est_patterns)
mir_eval.pattern.occurrence_FPR(reference_patterns, estimated_patterns, thres=0.75, similarity_metric='cardinality_score')

Establishment F1 Score, Precision and Recall.

Parameters
reference_patternslist

The reference patterns in the format returned by mir_eval.io.load_patterns()

estimated_patternslist

The estimated patterns in the same format

thresfloat

How similar two occcurrences must be in order to be considered equal (Default value = .75)

similarity_metricstr

A string representing the metric to be used when computing the similarity matrix. Accepted values:

  • “cardinality_score”: Count of the intersection between occurrences.

(Default value = “cardinality_score”)

Returns
f_measurefloat

The establishment F1 Score

precisionfloat

The establishment Precision

recallfloat

The establishment Recall

Examples

>>> ref_patterns = mir_eval.io.load_patterns("ref_pattern.txt")
>>> est_patterns = mir_eval.io.load_patterns("est_pattern.txt")
>>> F, P, R = mir_eval.pattern.occurrence_FPR(ref_patterns,
...                                           est_patterns)
mir_eval.pattern.three_layer_FPR(reference_patterns, estimated_patterns)

Three Layer F1 Score, Precision and Recall. As described by Meridith.

Parameters
reference_patternslist

The reference patterns in the format returned by mir_eval.io.load_patterns()

estimated_patternslist

The estimated patterns in the same format

Returns
f_measurefloat

The three-layer F1 Score

precisionfloat

The three-layer Precision

recallfloat

The three-layer Recall

Examples

>>> ref_patterns = mir_eval.io.load_patterns("ref_pattern.txt")
>>> est_patterns = mir_eval.io.load_patterns("est_pattern.txt")
>>> F, P, R = mir_eval.pattern.three_layer_FPR(ref_patterns,
...                                            est_patterns)
mir_eval.pattern.first_n_three_layer_P(reference_patterns, estimated_patterns, n=5)

First n three-layer precision.

This metric is basically the same as the three-layer FPR but it is only applied to the first n estimated patterns, and it only returns the precision. In MIREX and typically, n = 5.

Parameters
reference_patternslist

The reference patterns in the format returned by mir_eval.io.load_patterns()

estimated_patternslist

The estimated patterns in the same format

nint

Number of patterns to consider from the estimated results, in the order they appear in the matrix (Default value = 5)

Returns
precisionfloat

The first n three-layer Precision

Examples

>>> ref_patterns = mir_eval.io.load_patterns("ref_pattern.txt")
>>> est_patterns = mir_eval.io.load_patterns("est_pattern.txt")
>>> P = mir_eval.pattern.first_n_three_layer_P(ref_patterns,
...                                            est_patterns, n=5)
mir_eval.pattern.first_n_target_proportion_R(reference_patterns, estimated_patterns, n=5)

First n target proportion establishment recall metric.

This metric is similar is similar to the establishment FPR score, but it only takes into account the first n estimated patterns and it only outputs the Recall value of it.

Parameters
reference_patternslist

The reference patterns in the format returned by mir_eval.io.load_patterns()

estimated_patternslist

The estimated patterns in the same format

nint

Number of patterns to consider from the estimated results, in the order they appear in the matrix. (Default value = 5)

Returns
recallfloat

The first n target proportion Recall.

Examples

>>> ref_patterns = mir_eval.io.load_patterns("ref_pattern.txt")
>>> est_patterns = mir_eval.io.load_patterns("est_pattern.txt")
>>> R = mir_eval.pattern.first_n_target_proportion_R(
...                                 ref_patterns, est_patterns, n=5)
mir_eval.pattern.evaluate(ref_patterns, est_patterns, **kwargs)

Load data and perform the evaluation.

Parameters
ref_patternslist

The reference patterns in the format returned by mir_eval.io.load_patterns()

est_patternslist

The estimated patterns in the same format

kwargs

Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.

Returns
scoresdict

Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.

Examples

>>> ref_patterns = mir_eval.io.load_patterns("ref_pattern.txt")
>>> est_patterns = mir_eval.io.load_patterns("est_pattern.txt")
>>> scores = mir_eval.pattern.evaluate(ref_patterns, est_patterns)

mir_eval.segment

Evaluation criteria for structural segmentation fall into two categories: boundary annotation and structural annotation. Boundary annotation is the task of predicting the times at which structural changes occur, such as when a verse transitions to a refrain. Metrics for boundary annotation compare estimated segment boundaries to reference boundaries. Structural annotation is the task of assigning labels to detected segments. The estimated labels may be arbitrary strings - such as A, B, C, - and they need not describe functional concepts. Metrics for structural annotation are similar to those used for clustering data.

Conventions

Both boundary and structural annotation metrics require two dimensional arrays with two columns, one for boundary start times and one for boundary end times. Structural annotation further require lists of reference and estimated segment labels which must have a length which is equal to the number of rows in the corresponding list of boundary edges. In both tasks, we assume that annotations express a partitioning of the track into intervals. The function mir_eval.util.adjust_intervals() can be used to pad or crop the segment boundaries to span the duration of the entire track.

Metrics

  • mir_eval.segment.detection(): An estimated boundary is considered correct if it falls within a window around a reference boundary 11

  • mir_eval.segment.deviation(): Computes the median absolute time difference from a reference boundary to its nearest estimated boundary, and vice versa 11

  • mir_eval.segment.pairwise(): For classifying pairs of sampled time instants as belonging to the same structural component 12

  • mir_eval.segment.rand_index(): Clusters reference and estimated annotations and compares them by the Rand Index

  • mir_eval.segment.ari(): Computes the Rand index, adjusted for chance

  • mir_eval.segment.nce(): Interprets sampled reference and estimated labels as samples of random variables Y_R, Y_E from which the conditional entropy of Y_R given Y_E (Under-Segmentation) and Y_E given Y_R (Over-Segmentation) are estimated 13

  • mir_eval.segment.mutual_information(): Computes the standard, normalized, and adjusted mutual information of sampled reference and estimated segments

  • mir_eval.segment.vmeasure(): Computes the V-Measure, which is similar to the conditional entropy metrics, but uses the marginal distributions as normalization rather than the maximum entropy distribution 14

References

11(1,2)

Turnbull, D., Lanckriet, G. R., Pampalk, E., & Goto, M. A Supervised Approach for Detecting Boundaries in Music Using Difference Features and Boosting. In ISMIR (pp. 51-54).

12

Levy, M., & Sandler, M. Structural segmentation of musical audio by constrained clustering. IEEE transactions on audio, speech, and language processing, 16(2), 318-326.

13

Lukashevich, H. M. Towards Quantitative Measures of Evaluating Song Segmentation. In ISMIR (pp. 375-380).

14

Rosenberg, A., & Hirschberg, J. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. In EMNLP-CoNLL (Vol. 7, pp. 410-420).

mir_eval.segment.validate_boundary(reference_intervals, estimated_intervals, trim)

Checks that the input annotations to a segment boundary estimation metric (i.e. one that only takes in segment intervals) look like valid segment times, and throws helpful errors if not.

Parameters
reference_intervalsnp.ndarray, shape=(n, 2)

reference segment intervals, in the format returned by mir_eval.io.load_intervals() or mir_eval.io.load_labeled_intervals().

estimated_intervalsnp.ndarray, shape=(m, 2)

estimated segment intervals, in the format returned by mir_eval.io.load_intervals() or mir_eval.io.load_labeled_intervals().

trimbool

will the start and end events be trimmed?

mir_eval.segment.validate_structure(reference_intervals, reference_labels, estimated_intervals, estimated_labels)

Checks that the input annotations to a structure estimation metric (i.e. one that takes in both segment boundaries and their labels) look like valid segment times and labels, and throws helpful errors if not.

Parameters
reference_intervalsnp.ndarray, shape=(n, 2)

reference segment intervals, in the format returned by mir_eval.io.load_labeled_intervals().

reference_labelslist, shape=(n,)

reference segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

estimated_intervalsnp.ndarray, shape=(m, 2)

estimated segment intervals, in the format returned by mir_eval.io.load_labeled_intervals().

estimated_labelslist, shape=(m,)

estimated segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

mir_eval.segment.detection(reference_intervals, estimated_intervals, window=0.5, beta=1.0, trim=False)

Boundary detection hit-rate.

A hit is counted whenever an reference boundary is within window of a estimated boundary. Note that each boundary is matched at most once: this is achieved by computing the size of a maximal matching between reference and estimated boundary points, subject to the window constraint.

Parameters
reference_intervalsnp.ndarray, shape=(n, 2)

reference segment intervals, in the format returned by mir_eval.io.load_intervals() or mir_eval.io.load_labeled_intervals().

estimated_intervalsnp.ndarray, shape=(m, 2)

estimated segment intervals, in the format returned by mir_eval.io.load_intervals() or mir_eval.io.load_labeled_intervals().

windowfloat > 0

size of the window of ‘correctness’ around ground-truth beats (in seconds) (Default value = 0.5)

betafloat > 0

weighting constant for F-measure. (Default value = 1.0)

trimboolean

if True, the first and last boundary times are ignored. Typically, these denote start (0) and end-markers. (Default value = False)

Returns
precisionfloat

precision of estimated predictions

recallfloat

recall of reference reference boundaries

f_measurefloat

F-measure (weighted harmonic mean of precision and recall)

Examples

>>> ref_intervals, _ = mir_eval.io.load_labeled_intervals('ref.lab')
>>> est_intervals, _ = mir_eval.io.load_labeled_intervals('est.lab')
>>> # With 0.5s windowing
>>> P05, R05, F05 = mir_eval.segment.detection(ref_intervals,
...                                            est_intervals,
...                                            window=0.5)
>>> # With 3s windowing
>>> P3, R3, F3 = mir_eval.segment.detection(ref_intervals,
...                                         est_intervals,
...                                         window=3)
>>> # Ignoring hits for the beginning and end of track
>>> P, R, F = mir_eval.segment.detection(ref_intervals,
...                                      est_intervals,
...                                      window=0.5,
...                                      trim=True)
mir_eval.segment.deviation(reference_intervals, estimated_intervals, trim=False)

Compute the median deviations between reference and estimated boundary times.

Parameters
reference_intervalsnp.ndarray, shape=(n, 2)

reference segment intervals, in the format returned by mir_eval.io.load_intervals() or mir_eval.io.load_labeled_intervals().

estimated_intervalsnp.ndarray, shape=(m, 2)

estimated segment intervals, in the format returned by mir_eval.io.load_intervals() or mir_eval.io.load_labeled_intervals().

trimboolean

if True, the first and last intervals are ignored. Typically, these denote start (0.0) and end-of-track markers. (Default value = False)

Returns
reference_to_estimatedfloat

median time from each reference boundary to the closest estimated boundary

estimated_to_referencefloat

median time from each estimated boundary to the closest reference boundary

Examples

>>> ref_intervals, _ = mir_eval.io.load_labeled_intervals('ref.lab')
>>> est_intervals, _ = mir_eval.io.load_labeled_intervals('est.lab')
>>> r_to_e, e_to_r = mir_eval.boundary.deviation(ref_intervals,
...                                              est_intervals)
mir_eval.segment.pairwise(reference_intervals, reference_labels, estimated_intervals, estimated_labels, frame_size=0.1, beta=1.0)

Frame-clustering segmentation evaluation by pair-wise agreement.

Parameters
reference_intervalsnp.ndarray, shape=(n, 2)

reference segment intervals, in the format returned by mir_eval.io.load_labeled_intervals().

reference_labelslist, shape=(n,)

reference segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

estimated_intervalsnp.ndarray, shape=(m, 2)

estimated segment intervals, in the format returned by mir_eval.io.load_labeled_intervals().

estimated_labelslist, shape=(m,)

estimated segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

frame_sizefloat > 0

length (in seconds) of frames for clustering (Default value = 0.1)

betafloat > 0

beta value for F-measure (Default value = 1.0)

Returns
precisionfloat > 0

Precision of detecting whether frames belong in the same cluster

recallfloat > 0

Recall of detecting whether frames belong in the same cluster

ffloat > 0

F-measure of detecting whether frames belong in the same cluster

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> # Trim or pad the estimate to match reference timing
>>> (ref_intervals,
...  ref_labels) = mir_eval.util.adjust_intervals(ref_intervals,
...                                               ref_labels,
...                                               t_min=0)
>>> (est_intervals,
...  est_labels) = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, t_min=0, t_max=ref_intervals.max())
>>> precision, recall, f = mir_eval.structure.pairwise(ref_intervals,
...                                                    ref_labels,
...                                                    est_intervals,
...                                                    est_labels)
mir_eval.segment.rand_index(reference_intervals, reference_labels, estimated_intervals, estimated_labels, frame_size=0.1, beta=1.0)

(Non-adjusted) Rand index.

Parameters
reference_intervalsnp.ndarray, shape=(n, 2)

reference segment intervals, in the format returned by mir_eval.io.load_labeled_intervals().

reference_labelslist, shape=(n,)

reference segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

estimated_intervalsnp.ndarray, shape=(m, 2)

estimated segment intervals, in the format returned by mir_eval.io.load_labeled_intervals().

estimated_labelslist, shape=(m,)

estimated segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

frame_sizefloat > 0

length (in seconds) of frames for clustering (Default value = 0.1)

betafloat > 0

beta value for F-measure (Default value = 1.0)

Returns
rand_indexfloat > 0

Rand index

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> # Trim or pad the estimate to match reference timing
>>> (ref_intervals,
...  ref_labels) = mir_eval.util.adjust_intervals(ref_intervals,
...                                               ref_labels,
...                                               t_min=0)
>>> (est_intervals,
...  est_labels) = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, t_min=0, t_max=ref_intervals.max())
>>> rand_index = mir_eval.structure.rand_index(ref_intervals,
...                                            ref_labels,
...                                            est_intervals,
...                                            est_labels)
mir_eval.segment.ari(reference_intervals, reference_labels, estimated_intervals, estimated_labels, frame_size=0.1)

Adjusted Rand Index (ARI) for frame clustering segmentation evaluation.

Parameters
reference_intervalsnp.ndarray, shape=(n, 2)

reference segment intervals, in the format returned by mir_eval.io.load_labeled_intervals().

reference_labelslist, shape=(n,)

reference segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

estimated_intervalsnp.ndarray, shape=(m, 2)

estimated segment intervals, in the format returned by mir_eval.io.load_labeled_intervals().

estimated_labelslist, shape=(m,)

estimated segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

frame_sizefloat > 0

length (in seconds) of frames for clustering (Default value = 0.1)

Returns
ari_scorefloat > 0

Adjusted Rand index between segmentations.

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> # Trim or pad the estimate to match reference timing
>>> (ref_intervals,
...  ref_labels) = mir_eval.util.adjust_intervals(ref_intervals,
...                                               ref_labels,
...                                               t_min=0)
>>> (est_intervals,
...  est_labels) = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, t_min=0, t_max=ref_intervals.max())
>>> ari_score = mir_eval.structure.ari(ref_intervals, ref_labels,
...                                    est_intervals, est_labels)
mir_eval.segment.mutual_information(reference_intervals, reference_labels, estimated_intervals, estimated_labels, frame_size=0.1)

Frame-clustering segmentation: mutual information metrics.

Parameters
reference_intervalsnp.ndarray, shape=(n, 2)

reference segment intervals, in the format returned by mir_eval.io.load_labeled_intervals().

reference_labelslist, shape=(n,)

reference segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

estimated_intervalsnp.ndarray, shape=(m, 2)

estimated segment intervals, in the format returned by mir_eval.io.load_labeled_intervals().

estimated_labelslist, shape=(m,)

estimated segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

frame_sizefloat > 0

length (in seconds) of frames for clustering (Default value = 0.1)

Returns
MIfloat > 0

Mutual information between segmentations

AMIfloat

Adjusted mutual information between segmentations.

NMIfloat > 0

Normalize mutual information between segmentations

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> # Trim or pad the estimate to match reference timing
>>> (ref_intervals,
...  ref_labels) = mir_eval.util.adjust_intervals(ref_intervals,
...                                               ref_labels,
...                                               t_min=0)
>>> (est_intervals,
...  est_labels) = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, t_min=0, t_max=ref_intervals.max())
>>> mi, ami, nmi = mir_eval.structure.mutual_information(ref_intervals,
...                                                      ref_labels,
...                                                      est_intervals,
...                                                      est_labels)
mir_eval.segment.nce(reference_intervals, reference_labels, estimated_intervals, estimated_labels, frame_size=0.1, beta=1.0, marginal=False)

Frame-clustering segmentation: normalized conditional entropy

Computes cross-entropy of cluster assignment, normalized by the max-entropy.

Parameters
reference_intervalsnp.ndarray, shape=(n, 2)

reference segment intervals, in the format returned by mir_eval.io.load_labeled_intervals().

reference_labelslist, shape=(n,)

reference segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

estimated_intervalsnp.ndarray, shape=(m, 2)

estimated segment intervals, in the format returned by mir_eval.io.load_labeled_intervals().

estimated_labelslist, shape=(m,)

estimated segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

frame_sizefloat > 0

length (in seconds) of frames for clustering (Default value = 0.1)

betafloat > 0

beta for F-measure (Default value = 1.0)

marginalbool

If False, normalize conditional entropy by uniform entropy. If True, normalize conditional entropy by the marginal entropy. (Default value = False)

Returns
S_over

Over-clustering score:

  • For marginal=False, 1 - H(y_est | y_ref) / log(|y_est|)

  • For marginal=True, 1 - H(y_est | y_ref) / H(y_est)

If |y_est|==1, then S_over will be 0.

S_under

Under-clustering score:

  • For marginal=False, 1 - H(y_ref | y_est) / log(|y_ref|)

  • For marginal=True, 1 - H(y_ref | y_est) / H(y_ref)

If |y_ref|==1, then S_under will be 0.

S_F

F-measure for (S_over, S_under)

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> # Trim or pad the estimate to match reference timing
>>> (ref_intervals,
...  ref_labels) = mir_eval.util.adjust_intervals(ref_intervals,
...                                               ref_labels,
...                                               t_min=0)
>>> (est_intervals,
...  est_labels) = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, t_min=0, t_max=ref_intervals.max())
>>> S_over, S_under, S_F = mir_eval.structure.nce(ref_intervals,
...                                               ref_labels,
...                                               est_intervals,
...                                               est_labels)
mir_eval.segment.vmeasure(reference_intervals, reference_labels, estimated_intervals, estimated_labels, frame_size=0.1, beta=1.0)

Frame-clustering segmentation: v-measure

Computes cross-entropy of cluster assignment, normalized by the marginal-entropy.

This is equivalent to nce(…, marginal=True).

Parameters
reference_intervalsnp.ndarray, shape=(n, 2)

reference segment intervals, in the format returned by mir_eval.io.load_labeled_intervals().

reference_labelslist, shape=(n,)

reference segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

estimated_intervalsnp.ndarray, shape=(m, 2)

estimated segment intervals, in the format returned by mir_eval.io.load_labeled_intervals().

estimated_labelslist, shape=(m,)

estimated segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

frame_sizefloat > 0

length (in seconds) of frames for clustering (Default value = 0.1)

betafloat > 0

beta for F-measure (Default value = 1.0)

Returns
V_precision

Over-clustering score: 1 - H(y_est | y_ref) / H(y_est)

If |y_est|==1, then V_precision will be 0.

V_recall

Under-clustering score: 1 - H(y_ref | y_est) / H(y_ref)

If |y_ref|==1, then V_recall will be 0.

V_F

F-measure for (V_precision, V_recall)

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> # Trim or pad the estimate to match reference timing
>>> (ref_intervals,
...  ref_labels) = mir_eval.util.adjust_intervals(ref_intervals,
...                                               ref_labels,
...                                               t_min=0)
>>> (est_intervals,
...  est_labels) = mir_eval.util.adjust_intervals(
...     est_intervals, est_labels, t_min=0, t_max=ref_intervals.max())
>>> V_precision, V_recall, V_F = mir_eval.structure.vmeasure(ref_intervals,
...                                                          ref_labels,
...                                                          est_intervals,
...                                                          est_labels)
mir_eval.segment.evaluate(ref_intervals, ref_labels, est_intervals, est_labels, **kwargs)

Compute all metrics for the given reference and estimated annotations.

Parameters
ref_intervalsnp.ndarray, shape=(n, 2)

reference segment intervals, in the format returned by mir_eval.io.load_labeled_intervals().

ref_labelslist, shape=(n,)

reference segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

est_intervalsnp.ndarray, shape=(m, 2)

estimated segment intervals, in the format returned by mir_eval.io.load_labeled_intervals().

est_labelslist, shape=(m,)

estimated segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

kwargs

Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.

Returns
scoresdict

Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.

Examples

>>> (ref_intervals,
...  ref_labels) = mir_eval.io.load_labeled_intervals('ref.lab')
>>> (est_intervals,
...  est_labels) = mir_eval.io.load_labeled_intervals('est.lab')
>>> scores = mir_eval.segment.evaluate(ref_intervals, ref_labels,
...                                    est_intervals, est_labels)

mir_eval.hierarchy

Evaluation criteria for hierarchical structure analysis.

Hierarchical structure analysis seeks to annotate a track with a nested decomposition of the temporal elements of the piece, effectively providing a kind of “parse tree” of the composition. Unlike the flat segmentation metrics defined in mir_eval.segment, which can only encode one level of analysis, hierarchical annotations expose the relationships between short segments and the larger compositional elements to which they belong.

Conventions

Annotations are assumed to take the form of an ordered list of segmentations. As in the mir_eval.segment metrics, each segmentation itself consists of an n-by-2 array of interval times, so that the i th segment spans time intervals[i, 0] to intervals[i, 1].

Hierarchical annotations are ordered by increasing specificity, so that the first segmentation should contain the fewest segments, and the last segmentation contains the most.

Metrics

References

15

Brian McFee, Oriol Nieto, and Juan P. Bello. “Hierarchical evaluation of segment boundary detection”, International Society for Music Information Retrieval (ISMIR) conference, 2015.

16

Brian McFee, Oriol Nieto, Morwaread Farbood, and Juan P. Bello. “Evaluating hierarchical structure in music annotations”, Frontiers in Psychology, 2017.

mir_eval.hierarchy.validate_hier_intervals(intervals_hier)

Validate a hierarchical segment annotation.

Parameters
intervals_hierordered list of segmentations
Raises
ValueError

If any segmentation does not span the full duration of the top-level segmentation.

If any segmentation does not start at 0.

mir_eval.hierarchy.tmeasure(reference_intervals_hier, estimated_intervals_hier, transitive=False, window=15.0, frame_size=0.1, beta=1.0)

Computes the tree measures for hierarchical segment annotations.

Parameters
reference_intervals_hierlist of ndarray

reference_intervals_hier[i] contains the segment intervals (in seconds) for the i th layer of the annotations. Layers are ordered from top to bottom, so that the last list of intervals should be the most specific.

estimated_intervals_hierlist of ndarray

Like reference_intervals_hier but for the estimated annotation

transitivebool

whether to compute the t-measures using transitivity or not.

windowfloat > 0

size of the window (in seconds). For each query frame q, result frames are only counted within q +- window.

frame_sizefloat > 0

length (in seconds) of frames. The frame size cannot be longer than the window.

betafloat > 0

beta parameter for the F-measure.

Returns
t_precisionnumber [0, 1]

T-measure Precision

t_recallnumber [0, 1]

T-measure Recall

t_measurenumber [0, 1]

F-beta measure for (t_precision, t_recall)

Raises
ValueError

If either of the input hierarchies are inconsistent

If the input hierarchies have different time durations

If frame_size > window or frame_size <= 0

mir_eval.hierarchy.lmeasure(reference_intervals_hier, reference_labels_hier, estimated_intervals_hier, estimated_labels_hier, frame_size=0.1, beta=1.0)

Computes the tree measures for hierarchical segment annotations.

Parameters
reference_intervals_hierlist of ndarray

reference_intervals_hier[i] contains the segment intervals (in seconds) for the i th layer of the annotations. Layers are ordered from top to bottom, so that the last list of intervals should be the most specific.

reference_labels_hierlist of list of str

reference_labels_hier[i] contains the segment labels for the ``i``th layer of the annotations

estimated_intervals_hierlist of ndarray
estimated_labels_hierlist of ndarray

Like reference_intervals_hier and reference_labels_hier but for the estimated annotation

frame_sizefloat > 0

length (in seconds) of frames. The frame size cannot be longer than the window.

betafloat > 0

beta parameter for the F-measure.

Returns
l_precisionnumber [0, 1]

L-measure Precision

l_recallnumber [0, 1]

L-measure Recall

l_measurenumber [0, 1]

F-beta measure for (l_precision, l_recall)

Raises
ValueError

If either of the input hierarchies are inconsistent

If the input hierarchies have different time durations

If frame_size > window or frame_size <= 0

mir_eval.hierarchy.evaluate(ref_intervals_hier, ref_labels_hier, est_intervals_hier, est_labels_hier, **kwargs)

Compute all hierarchical structure metrics for the given reference and estimated annotations.

Parameters
ref_intervals_hierlist of list-like
ref_labels_hierlist of list of str
est_intervals_hierlist of list-like
est_labels_hierlist of list of str

Hierarchical annotations are encoded as an ordered list of segmentations. Each segmentation itself is a list (or list-like) of intervals (*_intervals_hier) and a list of lists of labels (*_labels_hier).

kwargs

additional keyword arguments to the evaluation metrics.

Returns
scoresOrderedDict

Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.

T-measures are computed in both the “full” (transitive=True) and “reduced” (transitive=False) modes.

Raises
ValueError

Thrown when the provided annotations are not valid.

Examples

A toy example with two two-layer annotations

>>> ref_i = [[[0, 30], [30, 60]], [[0, 15], [15, 30], [30, 45], [45, 60]]]
>>> est_i = [[[0, 45], [45, 60]], [[0, 15], [15, 30], [30, 45], [45, 60]]]
>>> ref_l = [ ['A', 'B'], ['a', 'b', 'a', 'c'] ]
>>> est_l = [ ['A', 'B'], ['a', 'a', 'b', 'b'] ]
>>> scores = mir_eval.hierarchy.evaluate(ref_i, ref_l, est_i, est_l)
>>> dict(scores)
{'T-Measure full': 0.94822745804853459,
 'T-Measure reduced': 0.8732458222764804,
 'T-Precision full': 0.96569179094693058,
 'T-Precision reduced': 0.89939075137018787,
 'T-Recall full': 0.93138358189386117,
 'T-Recall reduced': 0.84857799953694923}

A more realistic example, using SALAMI pre-parsed annotations

>>> def load_salami(filename):
...     "load SALAMI event format as labeled intervals"
...     events, labels = mir_eval.io.load_labeled_events(filename)
...     intervals = mir_eval.util.boundaries_to_intervals(events)[0]
...     return intervals, labels[:len(intervals)]
>>> ref_files = ['data/10/parsed/textfile1_uppercase.txt',
...              'data/10/parsed/textfile1_lowercase.txt']
>>> est_files = ['data/10/parsed/textfile2_uppercase.txt',
...              'data/10/parsed/textfile2_lowercase.txt']
>>> ref = [load_salami(fname) for fname in ref_files]
>>> ref_int = [seg[0] for seg in ref]
>>> ref_lab = [seg[1] for seg in ref]
>>> est = [load_salami(fname) for fname in est_files]
>>> est_int = [seg[0] for seg in est]
>>> est_lab = [seg[1] for seg in est]
>>> scores = mir_eval.hierarchy.evaluate(ref_int, ref_lab,
...                                      est_hier, est_lab)
>>> dict(scores)
{'T-Measure full': 0.66029225561405358,
 'T-Measure reduced': 0.62001868041578034,
 'T-Precision full': 0.66844764668949885,
 'T-Precision reduced': 0.63252297209957919,
 'T-Recall full': 0.6523334654992341,
 'T-Recall reduced': 0.60799919710921635}

mir_eval.separation

Source separation algorithms attempt to extract recordings of individual sources from a recording of a mixture of sources. Evaluation methods for source separation compare the extracted sources from reference sources and attempt to measure the perceptual quality of the separation.

See also the bss_eval MATLAB toolbox:

http://bass-db.gforge.inria.fr/bss_eval/

Conventions

An audio signal is expected to be in the format of a 1-dimensional array where the entries are the samples of the audio signal. When providing a group of estimated or reference sources, they should be provided in a 2-dimensional array, where the first dimension corresponds to the source number and the second corresponds to the samples.

Metrics

References

17(1,2)

Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte, “Performance measurement in blind audio source separation,” IEEE Trans. on Audio, Speech and Language Processing, 14(4):1462-1469, 2006.

mir_eval.separation.validate(reference_sources, estimated_sources)

Checks that the input data to a metric are valid, and throws helpful errors if not.

Parameters
reference_sourcesnp.ndarray, shape=(nsrc, nsampl)

matrix containing true sources

estimated_sourcesnp.ndarray, shape=(nsrc, nsampl)

matrix containing estimated sources

mir_eval.separation.bss_eval_sources(reference_sources, estimated_sources, compute_permutation=True)

Ordering and measurement of the separation quality for estimated source signals in terms of filtered true source, interference and artifacts.

The decomposition allows a time-invariant filter distortion of length 512, as described in Section III.B of 17.

Passing False for compute_permutation will improve the computation performance of the evaluation; however, it is not always appropriate and is not the way that the BSS_EVAL Matlab toolbox computes bss_eval_sources.

Parameters
reference_sourcesnp.ndarray, shape=(nsrc, nsampl)

matrix containing true sources (must have same shape as estimated_sources)

estimated_sourcesnp.ndarray, shape=(nsrc, nsampl)

matrix containing estimated sources (must have same shape as reference_sources)

compute_permutationbool, optional

compute permutation of estimate/source combinations (True by default)

Returns
sdrnp.ndarray, shape=(nsrc,)

vector of Signal to Distortion Ratios (SDR)

sirnp.ndarray, shape=(nsrc,)

vector of Source to Interference Ratios (SIR)

sarnp.ndarray, shape=(nsrc,)

vector of Sources to Artifacts Ratios (SAR)

permnp.ndarray, shape=(nsrc,)

vector containing the best ordering of estimated sources in the mean SIR sense (estimated source number perm[j] corresponds to true source number j). Note: perm will be [0, 1, ..., nsrc-1] if compute_permutation is False.

References

18

Emmanuel Vincent, Shoko Araki, Fabian J. Theis, Guido Nolte, Pau Bofill, Hiroshi Sawada, Alexey Ozerov, B. Vikrham Gowreesunker, Dominik Lutter and Ngoc Q.K. Duong, “The Signal Separation Evaluation Campaign (2007-2010): Achievements and remaining challenges”, Signal Processing, 92, pp. 1928-1936, 2012.

Examples

>>> # reference_sources[n] should be an ndarray of samples of the
>>> # n'th reference source
>>> # estimated_sources[n] should be the same for the n'th estimated
>>> # source
>>> (sdr, sir, sar,
...  perm) = mir_eval.separation.bss_eval_sources(reference_sources,
...                                               estimated_sources)
mir_eval.separation.bss_eval_sources_framewise(reference_sources, estimated_sources, window=1323000, hop=661500, compute_permutation=False)

Framewise computation of bss_eval_sources

Please be aware that this function does not compute permutations (by default) on the possible relations between reference_sources and estimated_sources due to the dangers of a changing permutation. Therefore (by default), it assumes that reference_sources[i] corresponds to estimated_sources[i]. To enable computing permutations please set compute_permutation to be True and check that the returned perm is identical for all windows.

NOTE: if reference_sources and estimated_sources would be evaluated using only a single window or are shorter than the window length, the result of mir_eval.separation.bss_eval_sources() called on reference_sources and estimated_sources (with the compute_permutation parameter passed to mir_eval.separation.bss_eval_sources()) is returned.

Parameters
reference_sourcesnp.ndarray, shape=(nsrc, nsampl)

matrix containing true sources (must have the same shape as estimated_sources)

estimated_sourcesnp.ndarray, shape=(nsrc, nsampl)

matrix containing estimated sources (must have the same shape as reference_sources)

windowint, optional

Window length for framewise evaluation (default value is 30s at a sample rate of 44.1kHz)

hopint, optional

Hop size for framewise evaluation (default value is 15s at a sample rate of 44.1kHz)

compute_permutationbool, optional

compute permutation of estimate/source combinations for all windows (False by default)

Returns
sdrnp.ndarray, shape=(nsrc, nframes)

vector of Signal to Distortion Ratios (SDR)

sirnp.ndarray, shape=(nsrc, nframes)

vector of Source to Interference Ratios (SIR)

sarnp.ndarray, shape=(nsrc, nframes)

vector of Sources to Artifacts Ratios (SAR)

permnp.ndarray, shape=(nsrc, nframes)

vector containing the best ordering of estimated sources in the mean SIR sense (estimated source number perm[j] corresponds to true source number j). Note: perm will be range(nsrc) for all windows if compute_permutation is False

Examples

>>> # reference_sources[n] should be an ndarray of samples of the
>>> # n'th reference source
>>> # estimated_sources[n] should be the same for the n'th estimated
>>> # source
>>> (sdr, sir, sar,
...  perm) = mir_eval.separation.bss_eval_sources_framewise(
         reference_sources,
...      estimated_sources)
mir_eval.separation.bss_eval_images(reference_sources, estimated_sources, compute_permutation=True)

Implementation of the bss_eval_images function from the BSS_EVAL Matlab toolbox.

Ordering and measurement of the separation quality for estimated source signals in terms of filtered true source, interference and artifacts. This method also provides the ISR measure.

The decomposition allows a time-invariant filter distortion of length 512, as described in Section III.B of 17.

Passing False for compute_permutation will improve the computation performance of the evaluation; however, it is not always appropriate and is not the way that the BSS_EVAL Matlab toolbox computes bss_eval_images.

Parameters
reference_sourcesnp.ndarray, shape=(nsrc, nsampl, nchan)

matrix containing true sources

estimated_sourcesnp.ndarray, shape=(nsrc, nsampl, nchan)

matrix containing estimated sources

compute_permutationbool, optional

compute permutation of estimate/source combinations (True by default)

Returns
sdrnp.ndarray, shape=(nsrc,)

vector of Signal to Distortion Ratios (SDR)

isrnp.ndarray, shape=(nsrc,)

vector of source Image to Spatial distortion Ratios (ISR)

sirnp.ndarray, shape=(nsrc,)

vector of Source to Interference Ratios (SIR)

sarnp.ndarray, shape=(nsrc,)

vector of Sources to Artifacts Ratios (SAR)

permnp.ndarray, shape=(nsrc,)

vector containing the best ordering of estimated sources in the mean SIR sense (estimated source number perm[j] corresponds to true source number j). Note: perm will be (1,2,...,nsrc) if compute_permutation is False.

References

19

Emmanuel Vincent, Shoko Araki, Fabian J. Theis, Guido Nolte, Pau Bofill, Hiroshi Sawada, Alexey Ozerov, B. Vikrham Gowreesunker, Dominik Lutter and Ngoc Q.K. Duong, “The Signal Separation Evaluation Campaign (2007-2010): Achievements and remaining challenges”, Signal Processing, 92, pp. 1928-1936, 2012.

Examples

>>> # reference_sources[n] should be an ndarray of samples of the
>>> # n'th reference source
>>> # estimated_sources[n] should be the same for the n'th estimated
>>> # source
>>> (sdr, isr, sir, sar,
...  perm) = mir_eval.separation.bss_eval_images(reference_sources,
...                                               estimated_sources)
mir_eval.separation.bss_eval_images_framewise(reference_sources, estimated_sources, window=1323000, hop=661500, compute_permutation=False)

Framewise computation of bss_eval_images

Please be aware that this function does not compute permutations (by default) on the possible relations between reference_sources and estimated_sources due to the dangers of a changing permutation. Therefore (by default), it assumes that reference_sources[i] corresponds to estimated_sources[i]. To enable computing permutations please set compute_permutation to be True and check that the returned perm is identical for all windows.

NOTE: if reference_sources and estimated_sources would be evaluated using only a single window or are shorter than the window length, the result of bss_eval_images called on reference_sources and estimated_sources (with the compute_permutation parameter passed to bss_eval_images) is returned

Parameters
reference_sourcesnp.ndarray, shape=(nsrc, nsampl, nchan)

matrix containing true sources (must have the same shape as estimated_sources)

estimated_sourcesnp.ndarray, shape=(nsrc, nsampl, nchan)

matrix containing estimated sources (must have the same shape as reference_sources)

windowint

Window length for framewise evaluation

hopint

Hop size for framewise evaluation

compute_permutationbool, optional

compute permutation of estimate/source combinations for all windows (False by default)

Returns
sdrnp.ndarray, shape=(nsrc, nframes)

vector of Signal to Distortion Ratios (SDR)

isrnp.ndarray, shape=(nsrc, nframes)

vector of source Image to Spatial distortion Ratios (ISR)

sirnp.ndarray, shape=(nsrc, nframes)

vector of Source to Interference Ratios (SIR)

sarnp.ndarray, shape=(nsrc, nframes)

vector of Sources to Artifacts Ratios (SAR)

permnp.ndarray, shape=(nsrc, nframes)

vector containing the best ordering of estimated sources in the mean SIR sense (estimated source number perm[j] corresponds to true source number j) Note: perm will be range(nsrc) for all windows if compute_permutation is False

Examples

>>> # reference_sources[n] should be an ndarray of samples of the
>>> # n'th reference source
>>> # estimated_sources[n] should be the same for the n'th estimated
>>> # source
>>> (sdr, isr, sir, sar,
...  perm) = mir_eval.separation.bss_eval_images_framewise(
         reference_sources,
...      estimated_sources,
         window,
....     hop)
mir_eval.separation.evaluate(reference_sources, estimated_sources, **kwargs)

Compute all metrics for the given reference and estimated signals.

NOTE: This will always compute mir_eval.separation.bss_eval_images() for any valid input and will additionally compute mir_eval.separation.bss_eval_sources() for valid input with fewer than 3 dimensions.

Parameters
reference_sourcesnp.ndarray, shape=(nsrc, nsampl[, nchan])

matrix containing true sources

estimated_sourcesnp.ndarray, shape=(nsrc, nsampl[, nchan])

matrix containing estimated sources

kwargs

Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.

Returns
scoresdict

Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.

Examples

>>> # reference_sources[n] should be an ndarray of samples of the
>>> # n'th reference source
>>> # estimated_sources[n] should be the same for the n'th estimated source
>>> scores = mir_eval.separation.evaluate(reference_sources,
...                                       estimated_sources)

mir_eval.tempo

The goal of a tempo estimation algorithm is to automatically detect the tempo of a piece of music, measured in beats per minute (BPM).

See http://www.music-ir.org/mirex/wiki/2014:Audio_Tempo_Estimation for a description of the task and evaluation criteria.

Conventions

Reference and estimated tempi should be positive, and provided in ascending order as a numpy array of length 2.

The weighting value from the reference must be a float in the range [0, 1].

Metrics

mir_eval.tempo.validate_tempi(tempi, reference=True)

Checks that there are two non-negative tempi. For a reference value, at least one tempo has to be greater than zero.

Parameters
tempinp.ndarray

length-2 array of tempo, in bpm

referencebool

indicates a reference value

mir_eval.tempo.validate(reference_tempi, reference_weight, estimated_tempi)

Checks that the input annotations to a metric look like valid tempo annotations.

Parameters
reference_tempinp.ndarray

reference tempo values, in bpm

reference_weightfloat

perceptual weight of slow vs fast in reference

estimated_tempinp.ndarray

estimated tempo values, in bpm

mir_eval.tempo.detection(reference_tempi, reference_weight, estimated_tempi, tol=0.08)

Compute the tempo detection accuracy metric.

Parameters
reference_tempinp.ndarray, shape=(2,)

Two non-negative reference tempi

reference_weightfloat > 0

The relative strength of reference_tempi[0] vs reference_tempi[1].

estimated_tempinp.ndarray, shape=(2,)

Two non-negative estimated tempi.

tolfloat in [0, 1]:

The maximum allowable deviation from a reference tempo to count as a hit. |est_t - ref_t| <= tol * ref_t (Default value = 0.08)

Returns
p_scorefloat in [0, 1]

Weighted average of recalls: reference_weight * hits[0] + (1 - reference_weight) * hits[1]

one_correctbool

True if at least one reference tempo was correctly estimated

both_correctbool

True if both reference tempi were correctly estimated

Raises
ValueError

If the input tempi are ill-formed

If the reference weight is not in the range [0, 1]

If tol < 0 or tol > 1.

mir_eval.tempo.evaluate(reference_tempi, reference_weight, estimated_tempi, **kwargs)

Compute all metrics for the given reference and estimated annotations.

Parameters
reference_tempinp.ndarray, shape=(2,)

Two non-negative reference tempi

reference_weightfloat > 0

The relative strength of reference_tempi[0] vs reference_tempi[1].

estimated_tempinp.ndarray, shape=(2,)

Two non-negative estimated tempi.

kwargs

Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.

Returns
scoresdict

Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.

mir_eval.transcription

The aim of a transcription algorithm is to produce a symbolic representation of a recorded piece of music in the form of a set of discrete notes. There are different ways to represent notes symbolically. Here we use the piano-roll convention, meaning each note has a start time, a duration (or end time), and a single, constant, pitch value. Pitch values can be quantized (e.g. to a semitone grid tuned to 440 Hz), but do not have to be. Also, the transcription can contain the notes of a single instrument or voice (for example the melody), or the notes of all instruments/voices in the recording. This module is instrument agnostic: all notes in the estimate are compared against all notes in the reference.

There are many metrics for evaluating transcription algorithms. Here we limit ourselves to the most simple and commonly used: given two sets of notes, we count how many estimated notes match the reference, and how many do not. Based on these counts we compute the precision, recall, f-measure and overlap ratio of the estimate given the reference. The default criteria for considering two notes to be a match are adopted from the MIREX Multiple fundamental frequency estimation and tracking, Note Tracking subtask (task 2):

“This subtask is evaluated in two different ways. In the first setup , a returned note is assumed correct if its onset is within +-50ms of a reference note and its F0 is within +- quarter tone of the corresponding reference note, ignoring the returned offset values. In the second setup, on top of the above requirements, a correct returned note is required to have an offset value within 20% of the reference note’s duration around the reference note’s offset, or within 50ms whichever is larger.”

In short, we compute precision, recall, f-measure and overlap ratio, once without taking offsets into account, and the second time with.

For further details see Salamon, 2013 (page 186), and references therein:

Salamon, J. (2013). Melody Extraction from Polyphonic Music Signals. Ph.D. thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2013.

IMPORTANT NOTE: the evaluation code in mir_eval contains several important differences with respect to the code used in MIREX 2015 for the Note Tracking subtask on the Su dataset (henceforth “MIREX”):

  1. mir_eval uses bipartite graph matching to find the optimal pairing of reference notes to estimated notes. MIREX uses a greedy matching algorithm, which can produce sub-optimal note matching. This will result in mir_eval’s metrics being slightly higher compared to MIREX.

  2. MIREX rounds down the onset and offset times of each note to 2 decimal points using new_time = 0.01 * floor(time*100). mir_eval rounds down the note onset and offset times to 4 decinal points. This will bring our metrics down a notch compared to the MIREX results.

  3. In the MIREX wiki, the criterion for matching offsets is that they must be within 0.2 * ref_duration or 0.05 seconds from each other, whichever is greater (i.e. offset_dif <= max(0.2 * ref_duration, 0.05). The MIREX code however only uses a threshold of 0.2 * ref_duration, without the 0.05 second minimum. Since mir_eval does include this minimum, it might produce slightly higher results compared to MIREX.

This means that differences 1 and 3 bring mir_eval’s metrics up compared to MIREX, whilst 2 brings them down. Based on internal testing, overall the effect of these three differences is that the Precision, Recall and F-measure returned by mir_eval will be higher compared to MIREX by about 1%-2%.

Finally, note that different evaluation scripts have been used for the Multi-F0 Note Tracking task in MIREX over the years. In particular, some scripts used < for matching onsets, offsets, and pitch values, whilst the others used <= for these checks. mir_eval provides both options: by default the latter (<=) is used, but you can set strict=True when calling mir_eval.transcription.precision_recall_f1_overlap() in which case < will be used. The default value (strict=False) is the same as that used in MIREX 2015 for the Note Tracking subtask on the Su dataset.

Conventions

Notes should be provided in the form of an interval array and a pitch array. The interval array contains two columns, one for note onsets and the second for note offsets (each row represents a single note). The pitch array contains one column with the corresponding note pitch values (one value per note), represented by their fundamental frequency (f0) in Hertz.

Metrics

  • mir_eval.transcription.precision_recall_f1_overlap(): The precision, recall, F-measure, and Average Overlap Ratio of the note transcription, where an estimated note is considered correct if its pitch, onset and (optionally) offset are sufficiently close to a reference note.

  • mir_eval.transcription.onset_precision_recall_f1(): The precision, recall and F-measure of the note transcription, where an estimated note is considered correct if its onset is sufficiently close to a reference note’s onset. That is, these metrics are computed taking only note onsets into account, meaning two notes could be matched even if they have very different pitch values.

  • mir_eval.transcription.offset_precision_recall_f1(): The precision, recall and F-measure of the note transcription, where an estimated note is considered correct if its offset is sufficiently close to a reference note’s offset. That is, these metrics are computed taking only note offsets into account, meaning two notes could be matched even if they have very different pitch values.

mir_eval.transcription.validate(ref_intervals, ref_pitches, est_intervals, est_pitches)

Checks that the input annotations to a metric look like time intervals and a pitch list, and throws helpful errors if not.

Parameters
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

ref_pitchesnp.ndarray, shape=(n,)

Array of reference pitch values in Hertz

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

est_pitchesnp.ndarray, shape=(m,)

Array of estimated pitch values in Hertz

mir_eval.transcription.validate_intervals(ref_intervals, est_intervals)

Checks that the input annotations to a metric look like time intervals, and throws helpful errors if not.

Parameters
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

mir_eval.transcription.match_note_offsets(ref_intervals, est_intervals, offset_ratio=0.2, offset_min_tolerance=0.05, strict=False)

Compute a maximum matching between reference and estimated notes, only taking note offsets into account.

Given two note sequences represented by ref_intervals and est_intervals (see mir_eval.io.load_valued_intervals()), we seek the largest set of correspondences (i, j) such that the offset of reference note i has to be within offset_tolerance of the offset of estimated note j, where offset_tolerance is equal to offset_ratio times the reference note’s duration, i.e. offset_ratio * ref_duration[i] where ref_duration[i] = ref_intervals[i, 1] - ref_intervals[i, 0]. If the resulting offset_tolerance is less than offset_min_tolerance (50 ms by default) then offset_min_tolerance is used instead.

Every reference note is matched against at most one estimated note.

Note there are separate functions match_note_onsets() and match_notes() for matching notes based on onsets only or based on onset, offset, and pitch, respectively. This is because the rules for matching note onsets and matching note offsets are different.

Parameters
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

offset_ratiofloat > 0

The ratio of the reference note’s duration used to define the offset_tolerance. Default is 0.2 (20%), meaning the offset_tolerance will equal the ref_duration * 0.2, or 0.05 (50 ms), whichever is greater.

offset_min_tolerancefloat > 0

The minimum tolerance for offset matching. See offset_ratio description for an explanation of how the offset tolerance is determined.

strictbool

If strict=False (the default), threshold checks for offset matching are performed using <= (less than or equal). If strict=True, the threshold checks are performed using < (less than).

Returns
matchinglist of tuples

A list of matched reference and estimated notes. matching[i] == (i, j) where reference note i matches estimated note j.

mir_eval.transcription.match_note_onsets(ref_intervals, est_intervals, onset_tolerance=0.05, strict=False)

Compute a maximum matching between reference and estimated notes, only taking note onsets into account.

Given two note sequences represented by ref_intervals and est_intervals (see mir_eval.io.load_valued_intervals()), we see the largest set of correspondences (i,j) such that the onset of reference note i is within onset_tolerance of the onset of estimated note j.

Every reference note is matched against at most one estimated note.

Note there are separate functions match_note_offsets() and match_notes() for matching notes based on offsets only or based on onset, offset, and pitch, respectively. This is because the rules for matching note onsets and matching note offsets are different.

Parameters
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

onset_tolerancefloat > 0

The tolerance for an estimated note’s onset deviating from the reference note’s onset, in seconds. Default is 0.05 (50 ms).

strictbool

If strict=False (the default), threshold checks for onset matching are performed using <= (less than or equal). If strict=True, the threshold checks are performed using < (less than).

Returns
matchinglist of tuples

A list of matched reference and estimated notes. matching[i] == (i, j) where reference note i matches estimated note j.

mir_eval.transcription.match_notes(ref_intervals, ref_pitches, est_intervals, est_pitches, onset_tolerance=0.05, pitch_tolerance=50.0, offset_ratio=0.2, offset_min_tolerance=0.05, strict=False)

Compute a maximum matching between reference and estimated notes, subject to onset, pitch and (optionally) offset constraints.

Given two note sequences represented by ref_intervals, ref_pitches, est_intervals and est_pitches (see mir_eval.io.load_valued_intervals()), we seek the largest set of correspondences (i, j) such that:

  1. The onset of reference note i is within onset_tolerance of the onset of estimated note j.

  2. The pitch of reference note i is within pitch_tolerance of the pitch of estimated note j.

  3. If offset_ratio is not None, the offset of reference note i has to be within offset_tolerance of the offset of estimated note j, where offset_tolerance is equal to offset_ratio times the reference note’s duration, i.e. offset_ratio * ref_duration[i] where ref_duration[i] = ref_intervals[i, 1] - ref_intervals[i, 0]. If the resulting offset_tolerance is less than 0.05 (50 ms), 0.05 is used instead.

  4. If offset_ratio is None, note offsets are ignored, and only criteria 1 and 2 are taken into consideration.

Every reference note is matched against at most one estimated note.

This is useful for computing precision/recall metrics for note transcription.

Note there are separate functions match_note_onsets() and match_note_offsets() for matching notes based on onsets only or based on offsets only, respectively.

Parameters
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

ref_pitchesnp.ndarray, shape=(n,)

Array of reference pitch values in Hertz

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

est_pitchesnp.ndarray, shape=(m,)

Array of estimated pitch values in Hertz

onset_tolerancefloat > 0

The tolerance for an estimated note’s onset deviating from the reference note’s onset, in seconds. Default is 0.05 (50 ms).

pitch_tolerancefloat > 0

The tolerance for an estimated note’s pitch deviating from the reference note’s pitch, in cents. Default is 50.0 (50 cents).

offset_ratiofloat > 0 or None

The ratio of the reference note’s duration used to define the offset_tolerance. Default is 0.2 (20%), meaning the offset_tolerance will equal the ref_duration * 0.2, or 0.05 (50 ms), whichever is greater. If offset_ratio is set to None, offsets are ignored in the matching.

offset_min_tolerancefloat > 0

The minimum tolerance for offset matching. See offset_ratio description for an explanation of how the offset tolerance is determined. Note: this parameter only influences the results if offset_ratio is not None.

strictbool

If strict=False (the default), threshold checks for onset, offset, and pitch matching are performed using <= (less than or equal). If strict=True, the threshold checks are performed using < (less than).

Returns
matchinglist of tuples

A list of matched reference and estimated notes. matching[i] == (i, j) where reference note i matches estimated note j.

mir_eval.transcription.precision_recall_f1_overlap(ref_intervals, ref_pitches, est_intervals, est_pitches, onset_tolerance=0.05, pitch_tolerance=50.0, offset_ratio=0.2, offset_min_tolerance=0.05, strict=False, beta=1.0)

Compute the Precision, Recall and F-measure of correct vs incorrectly transcribed notes, and the Average Overlap Ratio for correctly transcribed notes (see average_overlap_ratio()). “Correctness” is determined based on note onset, pitch and (optionally) offset: an estimated note is assumed correct if its onset is within +-50ms of a reference note and its pitch (F0) is within +- quarter tone (50 cents) of the corresponding reference note. If offset_ratio is None, note offsets are ignored in the comparison. Otherwise, on top of the above requirements, a correct returned note is required to have an offset value within 20% (by default, adjustable via the offset_ratio parameter) of the reference note’s duration around the reference note’s offset, or within offset_min_tolerance (50 ms by default), whichever is larger.

Parameters
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

ref_pitchesnp.ndarray, shape=(n,)

Array of reference pitch values in Hertz

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

est_pitchesnp.ndarray, shape=(m,)

Array of estimated pitch values in Hertz

onset_tolerancefloat > 0

The tolerance for an estimated note’s onset deviating from the reference note’s onset, in seconds. Default is 0.05 (50 ms).

pitch_tolerancefloat > 0

The tolerance for an estimated note’s pitch deviating from the reference note’s pitch, in cents. Default is 50.0 (50 cents).

offset_ratiofloat > 0 or None

The ratio of the reference note’s duration used to define the offset_tolerance. Default is 0.2 (20%), meaning the offset_tolerance will equal the ref_duration * 0.2, or offset_min_tolerance (0.05 by default, i.e. 50 ms), whichever is greater. If offset_ratio is set to None, offsets are ignored in the evaluation.

offset_min_tolerancefloat > 0

The minimum tolerance for offset matching. See offset_ratio description for an explanation of how the offset tolerance is determined. Note: this parameter only influences the results if offset_ratio is not None.

strictbool

If strict=False (the default), threshold checks for onset, offset, and pitch matching are performed using <= (less than or equal). If strict=True, the threshold checks are performed using < (less than).

betafloat > 0

Weighting factor for f-measure (default value = 1.0).

Returns
precisionfloat

The computed precision score

recallfloat

The computed recall score

f_measurefloat

The computed F-measure score

avg_overlap_ratiofloat

The computed Average Overlap Ratio score

Examples

>>> ref_intervals, ref_pitches = mir_eval.io.load_valued_intervals(
...     'reference.txt')
>>> est_intervals, est_pitches = mir_eval.io.load_valued_intervals(
...     'estimated.txt')
>>> (precision,
...  recall,
...  f_measure) = mir_eval.transcription.precision_recall_f1_overlap(
...      ref_intervals, ref_pitches, est_intervals, est_pitches)
>>> (precision_no_offset,
...  recall_no_offset,
...  f_measure_no_offset) = (
...      mir_eval.transcription.precision_recall_f1_overlap(
...          ref_intervals, ref_pitches, est_intervals, est_pitches,
...          offset_ratio=None))
mir_eval.transcription.average_overlap_ratio(ref_intervals, est_intervals, matching)

Compute the Average Overlap Ratio between a reference and estimated note transcription. Given a reference and corresponding estimated note, their overlap ratio (OR) is defined as the ratio between the duration of the time segment in which the two notes overlap and the time segment spanned by the two notes combined (earliest onset to latest offset):

>>> OR = ((min(ref_offset, est_offset) - max(ref_onset, est_onset)) /
...     (max(ref_offset, est_offset) - min(ref_onset, est_onset)))

The Average Overlap Ratio (AOR) is given by the mean OR computed over all matching reference and estimated notes. The metric goes from 0 (worst) to 1 (best).

Note: this function assumes the matching of reference and estimated notes (see match_notes()) has already been performed and is provided by the matching parameter. Furthermore, it is highly recommended to validate the intervals (see validate_intervals()) before calling this function, otherwise it is possible (though unlikely) for this function to attempt a divide-by-zero operation.

Parameters
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

matchinglist of tuples

A list of matched reference and estimated notes. matching[i] == (i, j) where reference note i matches estimated note j.

Returns
avg_overlap_ratiofloat

The computed Average Overlap Ratio score

mir_eval.transcription.onset_precision_recall_f1(ref_intervals, est_intervals, onset_tolerance=0.05, strict=False, beta=1.0)

Compute the Precision, Recall and F-measure of note onsets: an estimated onset is considered correct if it is within +-50ms of a reference onset. Note that this metric completely ignores note offset and note pitch. This means an estimated onset will be considered correct if it matches a reference onset, even if the onsets come from notes with completely different pitches (i.e. notes that would not match with match_notes()).

Parameters
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

onset_tolerancefloat > 0

The tolerance for an estimated note’s onset deviating from the reference note’s onset, in seconds. Default is 0.05 (50 ms).

strictbool

If strict=False (the default), threshold checks for onset matching are performed using <= (less than or equal). If strict=True, the threshold checks are performed using < (less than).

betafloat > 0

Weighting factor for f-measure (default value = 1.0).

Returns
precisionfloat

The computed precision score

recallfloat

The computed recall score

f_measurefloat

The computed F-measure score

Examples

>>> ref_intervals, _ = mir_eval.io.load_valued_intervals(
...     'reference.txt')
>>> est_intervals, _ = mir_eval.io.load_valued_intervals(
...     'estimated.txt')
>>> (onset_precision,
...  onset_recall,
...  onset_f_measure) = mir_eval.transcription.onset_precision_recall_f1(
...      ref_intervals, est_intervals)
mir_eval.transcription.offset_precision_recall_f1(ref_intervals, est_intervals, offset_ratio=0.2, offset_min_tolerance=0.05, strict=False, beta=1.0)

Compute the Precision, Recall and F-measure of note offsets: an estimated offset is considered correct if it is within +-50ms (or 20% of the ref note duration, which ever is greater) of a reference offset. Note that this metric completely ignores note onsets and note pitch. This means an estimated offset will be considered correct if it matches a reference offset, even if the offsets come from notes with completely different pitches (i.e. notes that would not match with match_notes()).

Parameters
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

offset_ratiofloat > 0 or None

The ratio of the reference note’s duration used to define the offset_tolerance. Default is 0.2 (20%), meaning the offset_tolerance will equal the ref_duration * 0.2, or offset_min_tolerance (0.05 by default, i.e. 50 ms), whichever is greater.

offset_min_tolerancefloat > 0

The minimum tolerance for offset matching. See offset_ratio description for an explanation of how the offset tolerance is determined.

strictbool

If strict=False (the default), threshold checks for onset matching are performed using <= (less than or equal). If strict=True, the threshold checks are performed using < (less than).

betafloat > 0

Weighting factor for f-measure (default value = 1.0).

Returns
precisionfloat

The computed precision score

recallfloat

The computed recall score

f_measurefloat

The computed F-measure score

Examples

>>> ref_intervals, _ = mir_eval.io.load_valued_intervals(
...     'reference.txt')
>>> est_intervals, _ = mir_eval.io.load_valued_intervals(
...     'estimated.txt')
>>> (offset_precision,
...  offset_recall,
...  offset_f_measure) = mir_eval.transcription.offset_precision_recall_f1(
...      ref_intervals, est_intervals)
mir_eval.transcription.evaluate(ref_intervals, ref_pitches, est_intervals, est_pitches, **kwargs)

Compute all metrics for the given reference and estimated annotations.

Parameters
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

ref_pitchesnp.ndarray, shape=(n,)

Array of reference pitch values in Hertz

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

est_pitchesnp.ndarray, shape=(m,)

Array of estimated pitch values in Hertz

kwargs

Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.

Returns
scoresdict

Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.

Examples

>>> ref_intervals, ref_pitches = mir_eval.io.load_valued_intervals(
...    'reference.txt')
>>> est_intervals, est_pitches = mir_eval.io.load_valued_intervals(
...    'estimate.txt')
>>> scores = mir_eval.transcription.evaluate(ref_intervals, ref_pitches,
...     est_intervals, est_pitches)

mir_eval.transcription_velocity

Transcription evaluation, as defined in mir_eval.transcription, does not take into account the velocities of reference and estimated notes. This submodule implements a variant of mir_eval.transcription.precision_recall_f1_overlap() which additionally considers note velocity when determining whether a note is correctly transcribed. This is done by defining a new function mir_eval.transcription_velocity.match_notes() which first calls mir_eval.transcription.match_notes() to get a note matching based on onset, offset, and pitch. Then, we follow the evaluation procedure described in 20 to test whether an estimated note should be considered correct:

  1. Reference velocities are re-scaled to the range [0, 1].

  2. A linear regression is performed to estimate global scale and offset parameters which minimize the L2 distance between matched estimated and (rescaled) reference notes.

  3. The scale and offset parameters are used to rescale estimated velocities.

  4. An estimated/reference note pair which has been matched according to the onset, offset, and pitch is further only considered correct if the rescaled velocities are within a predefined threshold, defaulting to 0.1.

mir_eval.transcription_velocity.match_notes() is used to define a new variant mir_eval.transcription_velocity.precision_recall_f1_overlap() which considers velocity.

Conventions

This submodule follows the conventions of mir_eval.transcription and additionally requires velocities to be provided as MIDI velocities in the range [0, 127].

Metrics

References

20

Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, and Douglas Eck, “Onsets and Frames: Dual-Objective Piano Transcription”, Proceedings of the 19th International Society for Music Information Retrieval Conference, 2018.

mir_eval.transcription_velocity.validate(ref_intervals, ref_pitches, ref_velocities, est_intervals, est_pitches, est_velocities)

Checks that the input annotations have valid time intervals, pitches, and velocities, and throws helpful errors if not.

Parameters
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

ref_pitchesnp.ndarray, shape=(n,)

Array of reference pitch values in Hertz

ref_velocitiesnp.ndarray, shape=(n,)

Array of MIDI velocities (i.e. between 0 and 127) of reference notes

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

est_pitchesnp.ndarray, shape=(m,)

Array of estimated pitch values in Hertz

est_velocitiesnp.ndarray, shape=(m,)

Array of MIDI velocities (i.e. between 0 and 127) of estimated notes

mir_eval.transcription_velocity.match_notes(ref_intervals, ref_pitches, ref_velocities, est_intervals, est_pitches, est_velocities, onset_tolerance=0.05, pitch_tolerance=50.0, offset_ratio=0.2, offset_min_tolerance=0.05, strict=False, velocity_tolerance=0.1)

Match notes, taking note velocity into consideration.

This function first calls mir_eval.transcription.match_notes() to match notes according to the supplied intervals, pitches, onset, offset, and pitch tolerances. The velocities of the matched notes are then used to estimate a slope and intercept which can rescale the estimated velocities so that they are as close as possible (in L2 sense) to their matched reference velocities. Velocities are then normalized to the range [0, 1]. A estimated note is then further only considered correct if its velocity is within velocity_tolerance of its matched (according to pitch and timing) reference note.

Parameters
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

ref_pitchesnp.ndarray, shape=(n,)

Array of reference pitch values in Hertz

ref_velocitiesnp.ndarray, shape=(n,)

Array of MIDI velocities (i.e. between 0 and 127) of reference notes

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

est_pitchesnp.ndarray, shape=(m,)

Array of estimated pitch values in Hertz

est_velocitiesnp.ndarray, shape=(m,)

Array of MIDI velocities (i.e. between 0 and 127) of estimated notes

onset_tolerancefloat > 0

The tolerance for an estimated note’s onset deviating from the reference note’s onset, in seconds. Default is 0.05 (50 ms).

pitch_tolerancefloat > 0

The tolerance for an estimated note’s pitch deviating from the reference note’s pitch, in cents. Default is 50.0 (50 cents).

offset_ratiofloat > 0 or None

The ratio of the reference note’s duration used to define the offset_tolerance. Default is 0.2 (20%), meaning the offset_tolerance will equal the ref_duration * 0.2, or 0.05 (50 ms), whichever is greater. If offset_ratio is set to None, offsets are ignored in the matching.

offset_min_tolerancefloat > 0

The minimum tolerance for offset matching. See offset_ratio description for an explanation of how the offset tolerance is determined. Note: this parameter only influences the results if offset_ratio is not None.

strictbool

If strict=False (the default), threshold checks for onset, offset, and pitch matching are performed using <= (less than or equal). If strict=True, the threshold checks are performed using < (less than).

velocity_tolerancefloat > 0

Estimated notes are considered correct if, after rescaling and normalization to [0, 1], they are within velocity_tolerance of a matched reference note.

Returns
matchinglist of tuples

A list of matched reference and estimated notes. matching[i] == (i, j) where reference note i matches estimated note j.

mir_eval.transcription_velocity.precision_recall_f1_overlap(ref_intervals, ref_pitches, ref_velocities, est_intervals, est_pitches, est_velocities, onset_tolerance=0.05, pitch_tolerance=50.0, offset_ratio=0.2, offset_min_tolerance=0.05, strict=False, velocity_tolerance=0.1, beta=1.0)

Compute the Precision, Recall and F-measure of correct vs incorrectly transcribed notes, and the Average Overlap Ratio for correctly transcribed notes (see mir_eval.transcription.average_overlap_ratio()). “Correctness” is determined based on note onset, velocity, pitch and (optionally) offset. An estimated note is considered correct if

  1. Its onset is within onset_tolerance (default +-50ms) of a reference note

  2. Its pitch (F0) is within +/- pitch_tolerance (default one quarter tone, 50 cents) of the corresponding reference note

  3. Its velocity, after normalizing reference velocities to the range [0, 1] and globally rescaling estimated velocities to minimize L2 distance between matched reference notes, is within velocity_tolerance (default 0.1) the corresponding reference note

  4. If offset_ratio is None, note offsets are ignored in the comparison. Otherwise, on top of the above requirements, a correct returned note is required to have an offset value within offset_ratio` (default 20%) of the reference note’s duration around the reference note’s offset, or within offset_min_tolerance (default 50 ms), whichever is larger.

Parameters
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

ref_pitchesnp.ndarray, shape=(n,)

Array of reference pitch values in Hertz

ref_velocitiesnp.ndarray, shape=(n,)

Array of MIDI velocities (i.e. between 0 and 127) of reference notes

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

est_pitchesnp.ndarray, shape=(m,)

Array of estimated pitch values in Hertz

est_velocitiesnp.ndarray, shape=(n,)

Array of MIDI velocities (i.e. between 0 and 127) of estimated notes

onset_tolerancefloat > 0

The tolerance for an estimated note’s onset deviating from the reference note’s onset, in seconds. Default is 0.05 (50 ms).

pitch_tolerancefloat > 0

The tolerance for an estimated note’s pitch deviating from the reference note’s pitch, in cents. Default is 50.0 (50 cents).

offset_ratiofloat > 0 or None

The ratio of the reference note’s duration used to define the offset_tolerance. Default is 0.2 (20%), meaning the offset_tolerance will equal the ref_duration * 0.2, or offset_min_tolerance (0.05 by default, i.e. 50 ms), whichever is greater. If offset_ratio is set to None, offsets are ignored in the evaluation.

offset_min_tolerancefloat > 0

The minimum tolerance for offset matching. See offset_ratio description for an explanation of how the offset tolerance is determined. Note: this parameter only influences the results if offset_ratio is not None.

strictbool

If strict=False (the default), threshold checks for onset, offset, and pitch matching are performed using <= (less than or equal). If strict=True, the threshold checks are performed using < (less than).

velocity_tolerancefloat > 0

Estimated notes are considered correct if, after rescaling and normalization to [0, 1], they are within velocity_tolerance of a matched reference note.

betafloat > 0

Weighting factor for f-measure (default value = 1.0).

Returns
precisionfloat

The computed precision score

recallfloat

The computed recall score

f_measurefloat

The computed F-measure score

avg_overlap_ratiofloat

The computed Average Overlap Ratio score

mir_eval.transcription_velocity.evaluate(ref_intervals, ref_pitches, ref_velocities, est_intervals, est_pitches, est_velocities, **kwargs)

Compute all metrics for the given reference and estimated annotations.

Parameters
ref_intervalsnp.ndarray, shape=(n,2)

Array of reference notes time intervals (onset and offset times)

ref_pitchesnp.ndarray, shape=(n,)

Array of reference pitch values in Hertz

ref_velocitiesnp.ndarray, shape=(n,)

Array of MIDI velocities (i.e. between 0 and 127) of reference notes

est_intervalsnp.ndarray, shape=(m,2)

Array of estimated notes time intervals (onset and offset times)

est_pitchesnp.ndarray, shape=(m,)

Array of estimated pitch values in Hertz

est_velocitiesnp.ndarray, shape=(n,)

Array of MIDI velocities (i.e. between 0 and 127) of estimated notes

kwargs

Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.

Returns
scoresdict

Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.

mir_eval.key

Key Detection involves determining the underlying key (distribution of notes and note transitions) in a piece of music. Key detection algorithms are evaluated by comparing their estimated key to a ground-truth reference key and reporting a score according to the relationship of the keys.

Conventions

Keys are represented as strings of the form '(key) (mode)', e.g. 'C# major' or 'Fb minor'. The case of the key is ignored. Note that certain key strings are equivalent, e.g. 'C# major' and 'Db major'. The mode may only be specified as either 'major' or 'minor', no other mode strings will be accepted.

Metrics

mir_eval.key.validate_key(key)

Checks that a key is well-formatted, e.g. in the form 'C# major'. The Key can be ‘X’ if it is not possible to categorize the Key and mode can be ‘other’ if it can’t be categorized as major or minor.

Parameters
keystr

Key to verify

mir_eval.key.validate(reference_key, estimated_key)

Checks that the input annotations to a metric are valid key strings and throws helpful errors if not.

Parameters
reference_keystr

Reference key string.

estimated_keystr

Estimated key string.

mir_eval.key.split_key_string(key)

Splits a key string (of the form, e.g. 'C# major'), into a tuple of (key, mode) where key is is an integer representing the semitone distance from C.

Parameters
keystr

String representing a key.

Returns
keyint

Number of semitones above C.

modestr

String representing the mode.

mir_eval.key.weighted_score(reference_key, estimated_key)

Computes a heuristic score which is weighted according to the relationship of the reference and estimated key, as follows:

Relationship

Score

Same key and mode

1.0

Estimated key is a perfect fifth above reference key

0.5

Relative major/minor (same key signature)

0.3

Parallel major/minor (same key)

0.2

Other

0.0

Parameters
reference_keystr

Reference key string.

estimated_keystr

Estimated key string.

Returns
scorefloat

Score representing how closely related the keys are.

Examples

>>> ref_key = mir_eval.io.load_key('ref.txt')
>>> est_key = mir_eval.io.load_key('est.txt')
>>> score = mir_eval.key.weighted_score(ref_key, est_key)
mir_eval.key.evaluate(reference_key, estimated_key, **kwargs)

Compute all metrics for the given reference and estimated annotations.

Parameters
ref_keystr

Reference key string.

ref_keystr

Estimated key string.

kwargs

Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.

Returns
scoresdict

Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.

Examples

>>> ref_key = mir_eval.io.load_key('reference.txt')
>>> est_key = mir_eval.io.load_key('estimated.txt')
>>> scores = mir_eval.key.evaluate(ref_key, est_key)

mir_eval.util

This submodule collects useful functionality required across the task submodules, such as preprocessing, validation, and common computations.

mir_eval.util.index_labels(labels, case_sensitive=False)

Convert a list of string identifiers into numerical indices.

Parameters
labelslist of strings, shape=(n,)

A list of annotations, e.g., segment or chord labels from an annotation file.

case_sensitivebool

Set to True to enable case-sensitive label indexing (Default value = False)

Returns
indiceslist, shape=(n,)

Numerical representation of labels

index_to_labeldict

Mapping to convert numerical indices back to labels. labels[i] == index_to_label[indices[i]]

mir_eval.util.generate_labels(items, prefix='__')

Given an array of items (e.g. events, intervals), create a synthetic label for each event of the form ‘(label prefix)(item number)’

Parameters
itemslist-like

A list or array of events or intervals

prefixstr

This prefix will be prepended to all synthetically generated labels (Default value = ‘__’)

Returns
labelslist of str

Synthetically generated labels

mir_eval.util.intervals_to_samples(intervals, labels, offset=0, sample_size=0.1, fill_value=None)

Convert an array of labeled time intervals to annotated samples.

Parameters
intervalsnp.ndarray, shape=(n, d)

An array of time intervals, as returned by mir_eval.io.load_intervals() or mir_eval.io.load_labeled_intervals(). The i th interval spans time intervals[i, 0] to intervals[i, 1].

labelslist, shape=(n,)

The annotation for each interval

offsetfloat > 0

Phase offset of the sampled time grid (in seconds) (Default value = 0)

sample_sizefloat > 0

duration of each sample to be generated (in seconds) (Default value = 0.1)

fill_valuetype(labels[0])

Object to use for the label with out-of-range time points. (Default value = None)

Returns
sample_timeslist

list of sample times

sample_labelslist

array of labels for each generated sample

Notes

Intervals will be rounded down to the nearest multiple of sample_size.

mir_eval.util.interpolate_intervals(intervals, labels, time_points, fill_value=None)

Assign labels to a set of points in time given a set of intervals.

Time points that do not lie within an interval are mapped to fill_value.

Parameters
intervalsnp.ndarray, shape=(n, 2)

An array of time intervals, as returned by mir_eval.io.load_intervals(). The i th interval spans time intervals[i, 0] to intervals[i, 1].

Intervals are assumed to be disjoint.

labelslist, shape=(n,)

The annotation for each interval

time_pointsarray_like, shape=(m,)

Points in time to assign labels. These must be in non-decreasing order.

fill_valuetype(labels[0])

Object to use for the label with out-of-range time points. (Default value = None)

Returns
aligned_labelslist

Labels corresponding to the given time points.

Raises
ValueError

If time_points is not in non-decreasing order.

mir_eval.util.sort_labeled_intervals(intervals, labels=None)

Sort intervals, and optionally, their corresponding labels according to start time.

Parameters
intervalsnp.ndarray, shape=(n, 2)

The input intervals

labelslist, optional

Labels for each interval

Returns
intervals_sorted or (intervals_sorted, labels_sorted)

Labels are only returned if provided as input

mir_eval.util.f_measure(precision, recall, beta=1.0)

Compute the f-measure from precision and recall scores.

Parameters
precisionfloat in (0, 1]

Precision

recallfloat in (0, 1]

Recall

betafloat > 0

Weighting factor for f-measure (Default value = 1.0)

Returns
f_measurefloat

The weighted f-measure

mir_eval.util.intervals_to_boundaries(intervals, q=5)

Convert interval times into boundaries.

Parameters
intervalsnp.ndarray, shape=(n_events, 2)

Array of interval start and end-times

qint

Number of decimals to round to. (Default value = 5)

Returns
boundariesnp.ndarray

Interval boundary times, including the end of the final interval

mir_eval.util.boundaries_to_intervals(boundaries)

Convert an array of event times into intervals

Parameters
boundarieslist-like

List-like of event times. These are assumed to be unique timestamps in ascending order.

Returns
intervalsnp.ndarray, shape=(n_intervals, 2)

Start and end time for each interval

mir_eval.util.adjust_intervals(intervals, labels=None, t_min=0.0, t_max=None, start_label='__T_MIN', end_label='__T_MAX')

Adjust a list of time intervals to span the range [t_min, t_max].

Any intervals lying completely outside the specified range will be removed.

Any intervals lying partially outside the specified range will be cropped.

If the specified range exceeds the span of the provided data in either direction, additional intervals will be appended. If an interval is appended at the beginning, it will be given the label start_label; if an interval is appended at the end, it will be given the label end_label.

Parameters
intervalsnp.ndarray, shape=(n_events, 2)

Array of interval start and end-times

labelslist, len=n_events or None

List of labels (Default value = None)

t_minfloat or None

Minimum interval start time. (Default value = 0.0)

t_maxfloat or None

Maximum interval end time. (Default value = None)

start_labelstr or float or int

Label to give any intervals appended at the beginning (Default value = ‘__T_MIN’)

end_labelstr or float or int

Label to give any intervals appended at the end (Default value = ‘__T_MAX’)

Returns
new_intervalsnp.ndarray

Intervals spanning [t_min, t_max]

new_labelslist

List of labels for new_labels

mir_eval.util.adjust_events(events, labels=None, t_min=0.0, t_max=None, label_prefix='__')

Adjust the given list of event times to span the range [t_min, t_max].

Any event times outside of the specified range will be removed.

If the times do not span [t_min, t_max], additional events will be added with the prefix label_prefix.

Parameters
eventsnp.ndarray

Array of event times (seconds)

labelslist or None

List of labels (Default value = None)

t_minfloat or None

Minimum valid event time. (Default value = 0.0)

t_maxfloat or None

Maximum valid event time. (Default value = None)

label_prefixstr

Prefix string to use for synthetic labels (Default value = ‘__’)

Returns
new_timesnp.ndarray

Event times corrected to the given range.

mir_eval.util.intersect_files(flist1, flist2)

Return the intersection of two sets of filepaths, based on the file name (after the final ‘/’) and ignoring the file extension.

Parameters
flist1list

first list of filepaths

flist2list

second list of filepaths

Returns
sublist1list

subset of filepaths with matching stems from flist1

sublist2list

corresponding filepaths from flist2

Examples

>>> flist1 = ['/a/b/abc.lab', '/c/d/123.lab', '/e/f/xyz.lab']
>>> flist2 = ['/g/h/xyz.npy', '/i/j/123.txt', '/k/l/456.lab']
>>> sublist1, sublist2 = mir_eval.util.intersect_files(flist1, flist2)
>>> print sublist1
['/e/f/xyz.lab', '/c/d/123.lab']
>>> print sublist2
['/g/h/xyz.npy', '/i/j/123.txt']
mir_eval.util.merge_labeled_intervals(x_intervals, x_labels, y_intervals, y_labels)

Merge the time intervals of two sequences.

Parameters
x_intervalsnp.ndarray

Array of interval times (seconds)

x_labelslist or None

List of labels

y_intervalsnp.ndarray

Array of interval times (seconds)

y_labelslist or None

List of labels

Returns
new_intervalsnp.ndarray

New interval times of the merged sequences.

new_x_labelslist

New labels for the sequence x

new_y_labelslist

New labels for the sequence y

mir_eval.util.match_events(ref, est, window, distance=None)

Compute a maximum matching between reference and estimated event times, subject to a window constraint.

Given two lists of event times ref and est, we seek the largest set of correspondences (ref[i], est[j]) such that distance(ref[i], est[j]) <= window, and each ref[i] and est[j] is matched at most once.

This is useful for computing precision/recall metrics in beat tracking, onset detection, and segmentation.

Parameters
refnp.ndarray, shape=(n,)

Array of reference values

estnp.ndarray, shape=(m,)

Array of estimated values

windowfloat > 0

Size of the window.

distancefunction

function that computes the outer distance of ref and est. By default uses |ref[i] - est[j]|

Returns
matchinglist of tuples

A list of matched reference and event numbers. matching[i] == (i, j) where ref[i] matches est[j].

mir_eval.util.validate_intervals(intervals)

Checks that an (n, 2) interval ndarray is well-formed, and raises errors if not.

Parameters
intervalsnp.ndarray, shape=(n, 2)

Array of interval start/end locations.

mir_eval.util.validate_events(events, max_time=30000.0)

Checks that a 1-d event location ndarray is well-formed, and raises errors if not.

Parameters
eventsnp.ndarray, shape=(n,)

Array of event times

max_timefloat

If an event is found above this time, a ValueError will be raised. (Default value = 30000.)

mir_eval.util.validate_frequencies(frequencies, max_freq, min_freq, allow_negatives=False)

Checks that a 1-d frequency ndarray is well-formed, and raises errors if not.

Parameters
frequenciesnp.ndarray, shape=(n,)

Array of frequency values

max_freqfloat

If a frequency is found above this pitch, a ValueError will be raised. (Default value = 5000.)

min_freqfloat

If a frequency is found below this pitch, a ValueError will be raised. (Default value = 20.)

allow_negativesbool

Whether or not to allow negative frequency values.

mir_eval.util.has_kwargs(function)

Determine whether a function has **kwargs.

Parameters
functioncallable

The function to test

Returns
True if function accepts arbitrary keyword arguments.
False otherwise.
mir_eval.util.filter_kwargs(_function, *args, **kwargs)

Given a function and args and keyword args to pass to it, call the function but using only the keyword arguments which it accepts. This is equivalent to redefining the function with an additional **kwargs to accept slop keyword args.

If the target function already accepts **kwargs parameters, no filtering is performed.

Parameters
_functioncallable

Function to call. Can take in any number of args or kwargs

mir_eval.util.intervals_to_durations(intervals)

Converts an array of n intervals to their n durations.

Parameters
intervalsnp.ndarray, shape=(n, 2)

An array of time intervals, as returned by mir_eval.io.load_intervals(). The i th interval spans time intervals[i, 0] to intervals[i, 1].

Returns
durationsnp.ndarray, shape=(n,)

Array of the duration of each interval.

mir_eval.util.hz_to_midi(freqs)

Convert Hz to MIDI numbers

Parameters
freqsnumber or ndarray

Frequency/frequencies in Hz

Returns
midinumber or ndarray

MIDI note numbers corresponding to input frequencies. Note that these may be fractional.

mir_eval.util.midi_to_hz(midi)

Convert MIDI numbers to Hz

Parameters
midinumber or ndarray

MIDI notes

Returns
freqsnumber or ndarray

Frequency/frequencies in Hz corresponding to midi

mir_eval.io

Functions for loading in annotations from files in different formats.

mir_eval.io.load_delimited(filename, converters, delimiter='\\s+', comment='#')

Utility function for loading in data from an annotation file where columns are delimited. The number of columns is inferred from the length of the provided converters list.

Parameters
filenamestr

Path to the annotation file

converterslist of functions

Each entry in column n of the file will be cast by the function converters[n].

delimiterstr

Separator regular expression. By default, lines will be split by any amount of whitespace.

commentstr or None

Comment regular expression. Any lines beginning with this string or pattern will be ignored.

Setting to None disables comments.

Returns
columnstuple of lists

Each list in this tuple corresponds to values in one of the columns in the file.

Examples

>>> # Load in a one-column list of event times (floats)
>>> load_delimited('events.txt', [float])
>>> # Load in a list of labeled events, separated by commas
>>> load_delimited('labeled_events.csv', [float, str], ',')
mir_eval.io.load_events(filename, delimiter='\\s+', comment='#')

Import time-stamp events from an annotation file. The file should consist of a single column of numeric values corresponding to the event times. This is primarily useful for processing events which lack duration, such as beats or onsets.

Parameters
filenamestr

Path to the annotation file

delimiterstr

Separator regular expression. By default, lines will be split by any amount of whitespace.

commentstr or None

Comment regular expression. Any lines beginning with this string or pattern will be ignored.

Setting to None disables comments.

Returns
event_timesnp.ndarray

array of event times (float)

mir_eval.io.load_labeled_events(filename, delimiter='\\s+', comment='#')

Import labeled time-stamp events from an annotation file. The file should consist of two columns; the first having numeric values corresponding to the event times and the second having string labels for each event. This is primarily useful for processing labeled events which lack duration, such as beats with metric beat number or onsets with an instrument label.

Parameters
filenamestr

Path to the annotation file

delimiterstr

Separator regular expression. By default, lines will be split by any amount of whitespace.

commentstr or None

Comment regular expression. Any lines beginning with this string or pattern will be ignored.

Setting to None disables comments.

Returns
event_timesnp.ndarray

array of event times (float)

labelslist of str

list of labels

mir_eval.io.load_intervals(filename, delimiter='\\s+', comment='#')

Import intervals from an annotation file. The file should consist of two columns of numeric values corresponding to start and end time of each interval. This is primarily useful for processing events which span a duration, such as segmentation, chords, or instrument activation.

Parameters
filenamestr

Path to the annotation file

delimiterstr

Separator regular expression. By default, lines will be split by any amount of whitespace.

commentstr or None

Comment regular expression. Any lines beginning with this string or pattern will be ignored.

Setting to None disables comments.

Returns
intervalsnp.ndarray, shape=(n_events, 2)

array of event start and end times

mir_eval.io.load_labeled_intervals(filename, delimiter='\\s+', comment='#')

Import labeled intervals from an annotation file. The file should consist of three columns: Two consisting of numeric values corresponding to start and end time of each interval and a third corresponding to the label of each interval. This is primarily useful for processing events which span a duration, such as segmentation, chords, or instrument activation.

Parameters
filenamestr

Path to the annotation file

delimiterstr

Separator regular expression. By default, lines will be split by any amount of whitespace.

commentstr or None

Comment regular expression. Any lines beginning with this string or pattern will be ignored.

Setting to None disables comments.

Returns
intervalsnp.ndarray, shape=(n_events, 2)

array of event start and end time

labelslist of str

list of labels

mir_eval.io.load_time_series(filename, delimiter='\\s+', comment='#')

Import a time series from an annotation file. The file should consist of two columns of numeric values corresponding to the time and value of each sample of the time series.

Parameters
filenamestr

Path to the annotation file

delimiterstr

Separator regular expression. By default, lines will be split by any amount of whitespace.

commentstr or None

Comment regular expression. Any lines beginning with this string or pattern will be ignored.

Setting to None disables comments.

Returns
timesnp.ndarray

array of timestamps (float)

valuesnp.ndarray

array of corresponding numeric values (float)

mir_eval.io.load_patterns(filename)

Loads the patters contained in the filename and puts them into a list of patterns, each pattern being a list of occurrence, and each occurrence being a list of (onset, midi) pairs.

The input file must be formatted as described in MIREX 2013: http://www.music-ir.org/mirex/wiki/2013:Discovery_of_Repeated_Themes_%26_Sections

Parameters
filenamestr

The input file path containing the patterns of a given piece using the MIREX 2013 format.

Returns
pattern_listlist

The list of patterns, containing all their occurrences, using the following format:

onset_midi = (onset_time, midi_number)
occurrence = [onset_midi1, ..., onset_midiO]
pattern = [occurrence1, ..., occurrenceM]
pattern_list = [pattern1, ..., patternN]

where N is the number of patterns, M[i] is the number of occurrences of the i th pattern, and O[j] is the number of onsets in the j’th occurrence. E.g.:

occ1 = [(0.5, 67.0), (1.0, 67.0), (1.5, 67.0), (2.0, 64.0)]
occ2 = [(4.5, 65.0), (5.0, 65.0), (5.5, 65.0), (6.0, 62.0)]
pattern1 = [occ1, occ2]

occ1 = [(10.5, 67.0), (11.0, 67.0), (11.5, 67.0), (12.0, 64.0),
        (12.5, 69.0), (13.0, 69.0), (13.5, 69.0), (14.0, 67.0),
        (14.5, 76.0), (15.0, 76.0), (15.5, 76.0), (16.0, 72.0)]
occ2 = [(18.5, 67.0), (19.0, 67.0), (19.5, 67.0), (20.0, 62.0),
        (20.5, 69.0), (21.0, 69.0), (21.5, 69.0), (22.0, 67.0),
        (22.5, 77.0), (23.0, 77.0), (23.5, 77.0), (24.0, 74.0)]
pattern2 = [occ1, occ2]

pattern_list = [pattern1, pattern2]
mir_eval.io.load_wav(path, mono=True)

Loads a .wav file as a numpy array using scipy.io.wavfile.

Parameters
pathstr

Path to a .wav file

monobool

If the provided .wav has more than one channel, it will be converted to mono if mono=True. (Default value = True)

Returns
audio_datanp.ndarray

Array of audio samples, normalized to the range [-1., 1.]

fsint

Sampling rate of the audio data

mir_eval.io.load_valued_intervals(filename, delimiter='\\s+', comment='#')

Import valued intervals from an annotation file. The file should consist of three columns: Two consisting of numeric values corresponding to start and end time of each interval and a third, also of numeric values, corresponding to the value of each interval. This is primarily useful for processing events which span a duration and have a numeric value, such as piano-roll notes which have an onset, offset, and a pitch value.

Parameters
filenamestr

Path to the annotation file

delimiterstr

Separator regular expression. By default, lines will be split by any amount of whitespace.

commentstr or None

Comment regular expression. Any lines beginning with this string or pattern will be ignored.

Setting to None disables comments.

Returns
intervalsnp.ndarray, shape=(n_events, 2)

Array of event start and end times

valuesnp.ndarray, shape=(n_events,)

Array of values

mir_eval.io.load_key(filename, delimiter='\\s+', comment='#')

Load key labels from an annotation file. The file should consist of two string columns: One denoting the key scale degree (semitone), and the other denoting the mode (major or minor). The file should contain only one row.

Parameters
filenamestr

Path to the annotation file

delimiterstr

Separator regular expression. By default, lines will be split by any amount of whitespace.

commentstr or None

Comment regular expression. Any lines beginning with this string or pattern will be ignored.

Setting to None disables comments.

Returns
keystr

Key label, in the form '(key) (mode)'

mir_eval.io.load_tempo(filename, delimiter='\\s+', comment='#')

Load tempo estimates from an annotation file in MIREX format. The file should consist of three numeric columns: the first two correspond to tempo estimates (in beats-per-minute), and the third denotes the relative confidence of the first value compared to the second (in the range [0, 1]). The file should contain only one row.

Parameters
filenamestr

Path to the annotation file

delimiterstr

Separator regular expression. By default, lines will be split by any amount of whitespace.

commentstr or None

Comment regular expression. Any lines beginning with this string or pattern will be ignored.

Setting to None disables comments.

Returns
tempinp.ndarray, non-negative

The two tempo estimates

weightfloat [0, 1]

The relative importance of tempi[0] compared to tempi[1]

mir_eval.io.load_ragged_time_series(filename, dtype=<class 'float'>, delimiter='\\s+', header=False, comment='#')

Utility function for loading in data from a delimited time series annotation file with a variable number of columns. Assumes that column 0 contains time stamps and columns 1 through n contain values. n may be variable from time stamp to time stamp.

Parameters
filenamestr

Path to the annotation file

dtypefunction

Data type to apply to values columns.

delimiterstr

Separator regular expression. By default, lines will be split by any amount of whitespace.

headerbool

Indicates whether a header row is present or not. By default, assumes no header is present.

commentstr or None

Comment regular expression. Any lines beginning with this string or pattern will be ignored.

Setting to None disables comments.

Returns
timesnp.ndarray

array of timestamps (float)

valueslist of np.ndarray

list of arrays of corresponding values

Examples

>>> # Load a ragged list of tab-delimited multi-f0 midi notes
>>> times, vals = load_ragged_time_series('multif0.txt', dtype=int,
                                          delimiter='\t')
>>> # Load a raggled list of space delimited multi-f0 values with a header
>>> times, vals = load_ragged_time_series('labeled_events.csv',
                                          header=True)

mir_eval.sonify

Methods which sonify annotations for “evaluation by ear”. All functions return a raw signal at the specified sampling rate.

mir_eval.sonify.clicks(times, fs, click=None, length=None)

Returns a signal with the signal ‘click’ placed at each specified time

Parameters
timesnp.ndarray

times to place clicks, in seconds

fsint

desired sampling rate of the output signal

clicknp.ndarray

click signal, defaults to a 1 kHz blip

lengthint

desired number of samples in the output signal, defaults to times.max()*fs + click.shape[0] + 1

Returns
click_signalnp.ndarray

Synthesized click signal

mir_eval.sonify.time_frequency(gram, frequencies, times, fs, function=<ufunc 'sin'>, length=None, n_dec=1)

Reverse synthesis of a time-frequency representation of a signal

Parameters
gramnp.ndarray

gram[n, m] is the magnitude of frequencies[n] from times[m] to times[m + 1]

Non-positive magnitudes are interpreted as silence.

frequenciesnp.ndarray

array of size gram.shape[0] denoting the frequency of each row of gram

timesnp.ndarray, shape= (gram.shape[1],) or (gram.shape[1], 2)

Either the start time of each column in the gram, or the time interval corresponding to each column.

fsint

desired sampling rate of the output signal

functionfunction

function to use to synthesize notes, should be 2\pi-periodic

lengthint

desired number of samples in the output signal, defaults to times[-1]*fs

n_decint

the number of decimals used to approximate each sonfied frequency. Defaults to 1 decimal place. Higher precision will be slower.

Returns
outputnp.ndarray

synthesized version of the piano roll

mir_eval.sonify.pitch_contour(times, frequencies, fs, amplitudes=None, function=<ufunc 'sin'>, length=None, kind='linear')

Sonify a pitch contour.

Parameters
timesnp.ndarray

time indices for each frequency measurement, in seconds

frequenciesnp.ndarray

frequency measurements, in Hz. Non-positive measurements will be interpreted as un-voiced samples.

fsint

desired sampling rate of the output signal

amplitudesnp.ndarray

amplitude measurments, nonnegative defaults to np.ones((length,))

functionfunction

function to use to synthesize notes, should be 2\pi-periodic

lengthint

desired number of samples in the output signal, defaults to max(times)*fs

kindstr

Interpolation mode for the frequency and amplitude values. See: scipy.interpolate.interp1d for valid settings.

Returns
outputnp.ndarray

synthesized version of the pitch contour

mir_eval.sonify.chroma(chromagram, times, fs, **kwargs)

Reverse synthesis of a chromagram (semitone matrix)

Parameters
chromagramnp.ndarray, shape=(12, times.shape[0])

Chromagram matrix, where each row represents a semitone [C->Bb] i.e., chromagram[3, j] is the magnitude of D# from times[j] to times[j + 1]

times: np.ndarray, shape=(len(chord_labels),) or (len(chord_labels), 2)

Either the start time of each column in the chromagram, or the time interval corresponding to each column.

fsint

Sampling rate to synthesize audio data at

kwargs

Additional keyword arguments to pass to mir_eval.sonify.time_frequency()

Returns
outputnp.ndarray

Synthesized chromagram

mir_eval.sonify.chords(chord_labels, intervals, fs, **kwargs)

Synthesizes chord labels

Parameters
chord_labelslist of str

List of chord label strings.

intervalsnp.ndarray, shape=(len(chord_labels), 2)

Start and end times of each chord label

fsint

Sampling rate to synthesize at

kwargs

Additional keyword arguments to pass to mir_eval.sonify.time_frequency()

Returns
outputnp.ndarray

Synthesized chord labels

mir_eval.display

Display functions

mir_eval.display.segments(intervals, labels, base=None, height=None, text=False, text_kw=None, ax=None, **kwargs)

Plot a segmentation as a set of disjoint rectangles.

Parameters
intervalsnp.ndarray, shape=(n, 2)

segment intervals, in the format returned by mir_eval.io.load_intervals() or mir_eval.io.load_labeled_intervals().

labelslist, shape=(n,)

reference segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

basenumber

The vertical position of the base of the rectangles. By default, this will be the bottom of the plot.

heightnumber

The height of the rectangles. By default, this will be the top of the plot (minus base).

textbool

If true, each segment’s label is displayed in its upper-left corner

text_kwdict

If text == True, the properties of the text object can be specified here. See matplotlib.pyplot.Text for valid parameters

axmatplotlib.pyplot.axes

An axis handle on which to draw the segmentation. If none is provided, a new set of axes is created.

kwargs

Additional keyword arguments to pass to matplotlib.patches.Rectangle.

Returns
axmatplotlib.pyplot.axes._subplots.AxesSubplot

A handle to the (possibly constructed) plot axes

mir_eval.display.labeled_intervals(intervals, labels, label_set=None, base=None, height=None, extend_labels=True, ax=None, tick=True, **kwargs)

Plot labeled intervals with each label on its own row.

Parameters
intervalsnp.ndarray, shape=(n, 2)

segment intervals, in the format returned by mir_eval.io.load_intervals() or mir_eval.io.load_labeled_intervals().

labelslist, shape=(n,)

reference segment labels, in the format returned by mir_eval.io.load_labeled_intervals().

label_setlist

An (ordered) list of labels to determine the plotting order. If not provided, the labels will be inferred from ax.get_yticklabels(). If no yticklabels exist, then the sorted set of unique values in labels is taken as the label set.

basenp.ndarray, shape=(n,), optional

Vertical positions of each label. By default, labels are positioned at integers np.arange(len(labels)).

heightscalar or np.ndarray, shape=(n,), optional

Height for each label. If scalar, the same value is applied to all labels. By default, each label has height=1.

extend_labelsbool

If False, only values of labels that also exist in label_set will be shown.

If True, all labels are shown, with those in labels but not in label_set appended to the top of the plot. A horizontal line is drawn to indicate the separation between values in or out of label_set.

axmatplotlib.pyplot.axes

An axis handle on which to draw the intervals. If none is provided, a new set of axes is created.

tickbool

If True, sets tick positions and labels on the y-axis.

kwargs

Additional keyword arguments to pass to matplotlib.collection.BrokenBarHCollection.

Returns
axmatplotlib.pyplot.axes._subplots.AxesSubplot

A handle to the (possibly constructed) plot axes

class mir_eval.display.IntervalFormatter(base, ticks)

Bases: matplotlib.ticker.Formatter

Ticker formatter for labeled interval plots.

Parameters
basearray-like of int

The base positions of each label

ticksarray-like of string

The labels for the ticks

Attributes
axis

Methods

__call__(x[, pos])

Return the format for tick value x at position pos.

fix_minus(s)

Some classes may want to replace a hyphen for minus with the proper unicode symbol (U+2212) for typographical correctness. This is a helper method to perform such a replacement when it is enabled via :rc:`axes.unicode_minus`.

format_data(value)

Return the full string representation of the value with the position unspecified.

format_data_short(value)

Return a short string version of the tick value.

format_ticks(values)

Return the tick labels for all the ticks at once.

set_locs(locs)

Set the locations of the ticks.

create_dummy_axis

get_offset

set_axis

set_bounds

set_data_interval

set_view_interval

mir_eval.display.hierarchy(intervals_hier, labels_hier, levels=None, ax=None, **kwargs)

Plot a hierarchical segmentation

Parameters
intervals_hierlist of np.ndarray

A list of segmentation intervals. Each element should be an n-by-2 array of segment intervals, in the format returned by mir_eval.io.load_intervals() or mir_eval.io.load_labeled_intervals(). Segmentations should be ordered by increasing specificity.

labels_hierlist of list-like

A list of segmentation labels. Each element should be a list of labels for the corresponding element in intervals_hier.

levelslist of string

Each element levels[i] is a label for the `i th segmentation. This is used in the legend to denote the levels in a segment hierarchy.

kwargs

Additional keyword arguments to labeled_intervals.

Returns
axmatplotlib.pyplot.axes._subplots.AxesSubplot

A handle to the (possibly constructed) plot axes

mir_eval.display.events(times, labels=None, base=None, height=None, ax=None, text_kw=None, **kwargs)

Plot event times as a set of vertical lines

Parameters
timesnp.ndarray, shape=(n,)

event times, in the format returned by mir_eval.io.load_events() or mir_eval.io.load_labeled_events().

labelslist, shape=(n,), optional

event labels, in the format returned by mir_eval.io.load_labeled_events().

basenumber

The vertical position of the base of the line. By default, this will be the bottom of the plot.

heightnumber

The height of the lines. By default, this will be the top of the plot (minus base).

axmatplotlib.pyplot.axes

An axis handle on which to draw the segmentation. If none is provided, a new set of axes is created.

text_kwdict

If labels is provided, the properties of the text objects can be specified here. See matplotlib.pyplot.Text for valid parameters

kwargs

Additional keyword arguments to pass to matplotlib.pyplot.vlines.

Returns
axmatplotlib.pyplot.axes._subplots.AxesSubplot

A handle to the (possibly constructed) plot axes

mir_eval.display.pitch(times, frequencies, midi=False, unvoiced=False, ax=None, **kwargs)

Visualize pitch contours

Parameters
timesnp.ndarray, shape=(n,)

Sample times of frequencies

frequenciesnp.ndarray, shape=(n,)

frequencies (in Hz) of the pitch contours. Voicing is indicated by sign (positive for voiced, non-positive for non-voiced).

midibool

If True, plot on a MIDI-numbered vertical axis. Otherwise, plot on a linear frequency axis.

unvoicedbool

If True, unvoiced pitch contours are plotted and indicated by transparency.

Otherwise, unvoiced pitch contours are omitted from the display.

axmatplotlib.pyplot.axes

An axis handle on which to draw the pitch contours. If none is provided, a new set of axes is created.

kwargs

Additional keyword arguments to matplotlib.pyplot.plot.

Returns
axmatplotlib.pyplot.axes._subplots.AxesSubplot

A handle to the (possibly constructed) plot axes

mir_eval.display.multipitch(times, frequencies, midi=False, unvoiced=False, ax=None, **kwargs)

Visualize multiple f0 measurements

Parameters
timesnp.ndarray, shape=(n,)

Sample times of frequencies

frequencieslist of np.ndarray

frequencies (in Hz) of the pitch measurements. Voicing is indicated by sign (positive for voiced, non-positive for non-voiced).

times and frequencies should be in the format produced by mir_eval.io.load_ragged_time_series()

midibool

If True, plot on a MIDI-numbered vertical axis. Otherwise, plot on a linear frequency axis.

unvoicedbool

If True, unvoiced pitches are plotted and indicated by transparency.

Otherwise, unvoiced pitches are omitted from the display.

axmatplotlib.pyplot.axes

An axis handle on which to draw the pitch contours. If none is provided, a new set of axes is created.

kwargs

Additional keyword arguments to plt.scatter.

Returns
axmatplotlib.pyplot.axes._subplots.AxesSubplot

A handle to the (possibly constructed) plot axes

mir_eval.display.piano_roll(intervals, pitches=None, midi=None, ax=None, **kwargs)

Plot a quantized piano roll as intervals

Parameters
intervalsnp.ndarray, shape=(n, 2)

timing intervals for notes

pitchesnp.ndarray, shape=(n,), optional

pitches of notes (in Hz).

midinp.ndarray, shape=(n,), optional

pitches of notes (in MIDI numbers).

At least one of pitches or midi must be provided.

axmatplotlib.pyplot.axes

An axis handle on which to draw the intervals. If none is provided, a new set of axes is created.

kwargs

Additional keyword arguments to labeled_intervals().

Returns
axmatplotlib.pyplot.axes._subplots.AxesSubplot

A handle to the (possibly constructed) plot axes

mir_eval.display.separation(sources, fs=22050, labels=None, alpha=0.75, ax=None, **kwargs)

Source-separation visualization

Parameters
sourcesnp.ndarray, shape=(nsrc, nsampl)

A list of waveform buffers corresponding to each source

fsnumber > 0

The sampling rate

labelslist of strings

An optional list of descriptors corresponding to each source

alphafloat in [0, 1]

Maximum alpha (opacity) of spectrogram values.

axmatplotlib.pyplot.axes

An axis handle on which to draw the spectrograms. If none is provided, a new set of axes is created.

kwargs

Additional keyword arguments to scipy.signal.spectrogram

Returns
ax

The axis handle for this plot

mir_eval.display.ticker_notes(ax=None)

Set the y-axis of the given axes to MIDI notes

Parameters
axmatplotlib.pyplot.axes

The axes handle to apply the ticker. By default, uses the current axes handle.

mir_eval.display.ticker_pitch(ax=None)

Set the y-axis of the given axes to MIDI frequencies

Parameters
axmatplotlib.pyplot.axes

The axes handle to apply the ticker. By default, uses the current axes handle.

Changes

Indices and tables