A machine learning-based Pokémon cry recognition system built with Python.
This project recognizes Pokémon from audio input using MFCC feature extraction and a RandomForest classifier, with a desktop GUI that supports:
- Audio file selection
- Microphone recording
- Top-3 prediction
- Confidence thresholding
- Pokémon image display
- Multilingual Pokémon names
The project was originally inspired by the idea of identifying Pokémon encounter sounds in arcade environments.
The system takes a Pokémon cry audio file or microphone recording as input, extracts audio features, predicts the most likely Pokémon, and displays the result through a GUI.
Main workflow:
Audio Input
↓
Preprocessing
↓
MFCC Feature Extraction
↓
RandomForest Classifier
↓
Top-3 Prediction
↓
GUI Result Display
- Recognition support for approximately 1000+ Pokémon
- Audio upload support
- Microphone recording support
- MFCC-based audio feature extraction
- RandomForest-based classification
- Top-3 prediction display
- Confidence threshold / Unknown detection
- Pokémon image display
- Automatic Pokémon image downloading
- English / Japanese / Chinese Pokémon names
- Audio augmentation pipeline
- Dataset generation pipeline
- Model evaluation tools
- Spectrogram generation pipeline
- Experimental CNN training pipeline
Evaluation on the clean and augmented dataset:
| Metric | Accuracy |
|---|---|
| Top-1 Accuracy | 99.06% |
| Top-3 Accuracy | 99.70% |
| Top-5 Accuracy | 99.82% |
Note:
These results are evaluated on the generated clean/augmented dataset and do not fully represent real-world arcade or microphone recording performance.
The current main model uses:
MFCC + RandomForestClassifier
This approach performed well under limited per-class data conditions.
An experimental CNN model using Mel Spectrogram images is also included.
Current observations suggest that:
MFCC + RandomForest significantly outperformed
the baseline CNN under limited per-class data conditions.
pokemon-cry-recognition/
│
├── src/
│ ├── app.py
│ ├── build_dataset.py
│ ├── train_model.py
│ ├── predict.py
│ ├── evaluate_model.py
│ ├── generate_spectrograms.py
│ ├── train_cnn.py
│ └── download_images.py
│
├── models/
│ └── pokemon_names.json
│
├── cries/
│
├── requirements.txt
├── .gitignore
└── README.md
git clone https://github.com/YOUR_USERNAME/pokemon-cry-recognition.git
cd pokemon-cry-recognitionpy -m venv venvActivate on Windows PowerShell:
.\venv\Scripts\Activate.ps1If PowerShell blocks activation, run:
Set-ExecutionPolicy -Scope CurrentUser RemoteSignedThen activate again.
py -m pip install -r requirements.txtThis project requires FFmpeg for audio processing.
Download FFmpeg from:
https://ffmpeg.org/download.html
Or install on Windows using:
winget install ffmpegVerify installation:
ffmpeg -versionThe dataset generation script:
- Reads Pokémon cry audio files
- Retrieves multilingual Pokémon names
- Creates ID-based folders
- Applies audio augmentation
- Extracts MFCC features
- Saves training data
Run:
py src/build_dataset.pyGenerated files:
models/features.npy
models/labels.npy
models/pokemon_names.json
py src/train_model.pyThis trains the RandomForest classifier and generates:
models/pokemon_cry_model.pkl
models/label_encoder.pkl
Note:
The trained model files are intentionally not included in this repository because of GitHub file size limitations.
Users should generate the model locally.
py src/app.pyThe GUI supports:
Select Audio File
Record Audio
Play Audio
Predict
Top-3 Result Display
Pokémon Image Display
Example:
py src/predict.py data_raw/025_Pikachu/25.oggExample output:
Predicted Pokémon: Pikachu
Confidence: 86.67%
py src/evaluate_model.pyGenerated evaluation outputs:
evaluation_results/
├── classification_report.csv
├── confusion_matrix.csv
├── sample_predictions.csv
├── weak_classes_by_recall.csv
└── low_confidence_predictions.csv
The evaluation script also reports:
Top-1 Accuracy
Top-3 Accuracy
Top-5 Accuracy
To generate Mel Spectrogram images:
py src/generate_spectrograms.pyGenerated output:
spectrograms/
These images are used for the experimental CNN model.
Train the experimental CNN model:
py src/train_cnn.pyCurrent observation:
MFCC + RandomForest performs significantly better
than the baseline CNN under limited per-class data conditions.
Future improvements may include:
- More augmented samples
- Better CNN architectures
- Transfer learning
- Real-world recording datasets
- Noise simulation
The long-term goal of this project is to build a Pokémon arcade encounter sound assistant.
Target workflow:
Arcade encounter sound
↓
Microphone recording
↓
Prediction
↓
Top-3 Pokémon candidates
Because real arcade environments contain:
- Background noise
- Speaker distortion
- Reverberation
- Human voices
future development will focus on noise robustness and real-world adaptation.
- Real arcade recording dataset
- Better microphone preprocessing
- Silence trimming
- Volume normalization
- Background noise augmentation
- Arcade speaker simulation
- Improved Unknown detection
- Better CNN performance
- Mobile application version
- Real-time continuous listening mode
- Executable packaging
- Python
- Librosa
- NumPy
- Scikit-learn
- RandomForestClassifier
- CustomTkinter
- Pillow
- SoundDevice
- SoundFile
- PyTorch
- Matplotlib
Pokémon is a trademark of Nintendo, Game Freak, Creatures, and The Pokémon Company.
This project is a non-commercial educational and experimental project.
It is not affiliated with or endorsed by Nintendo, Game Freak, Creatures, or The Pokémon Company.
