Philippine Languages Translation and AI Training Community
This organization is dedicated to the development of high-performance natural language processing (NLP) architectures for the major and regional languages of the Philippines. Our objective is to bridge the digital divide for low-resource languages through state-of-the-art model alignment, knowledge distillation, and the deployment of efficient, edge-ready AI models.
Click to view our Technical Roadmap
Technical Roadmap
Phase 1: Foundation Model Alignment and NMT Parity
Objective: Finetune large-scale transformer architectures (Llama 3.1/3.2 series) to achieve Neural Machine Translation (NMT) parity with commercial benchmarks for the eight major Philippine languages.
- Technical Detail: Implementation of Supervised Fine-Tuning (SFT) using high-quality parallel corpora and instruction-tuning datasets. This phase utilizes QLoRA and full-parameter tuning to optimize for Tagalog, Cebuano, Ilocano, Hiligaynon, Bicolano, Waray, Kapampangan, and Pangasinan.
- Milestone: Validated "Teacher" models capable of high-fidelity translation and complex instruction following, serving as the performance baseline for subsequent distillation.
Phase 2: Knowledge Distillation and Synthetic Corpus Generation
Objective: Utilize Phase 1 models as high-capacity Teacher models to generate high-density synthetic training data for low-resource linguistic variants.
- Technical Detail: Leveraging the Teacher models to perform Knowledge Distillation (KD) by generating synthetic instruction-response pairs and reasoning chains. This mitigates the scarcity of organic digital text in regional dialects and provides the required data density for training smaller student architectures without performance degradation.
- Milestone: A comprehensive multi-language synthetic dataset optimized for training sub-3B parameter models.
Phase 3: LFM 2.5 Implementation and Specialized Specialization
Objective: Train and specialize Liquid Foundation Model (LFM) 2.5 architectures to create lightweight, language-specific models.
- Technical Detail: Transitioning from standard Transformers to LFM 2.5 allows for linear scaling and reduced memory footprints. We use the distilled datasets from Phase 2 to train "Student" models that replicate the output distribution of the larger Llama models. Final optimization includes Direct Preference Optimization (DPO) to refine cultural and grammatical nuance for each specific language.
- Milestone: A suite of specialized, deployment-ready models (1.2B to 3B parameters) optimized for edge computing and local hardware integration.
---
Stakeholder Engagement and Collaboration
The community is actively seeking institutional and technical stakeholders to assist in the scaling, adoption, and operationalization of these models.
Call for Partners
- Compute Provisioning: We are seeking partners to provide GPU resources (A100/H100 clusters) required for the heavy compute cycles in Phase 1 and Phase 2.
- Domain-Specific Finetuning: We invite organizations to adopt and finetune our existing foundation models for specialized sectors, including legal, medical, and governmental services.
- Validation and Evaluation: We are looking for academic and linguistic experts to conduct rigorous human evaluation and Red Teaming to ensure model safety and linguistic accuracy across regional variants.
- Deployment Integration: We seek partners interested in integrating these lightweight models into mobile applications or environments with limited connectivity.
Interested parties may reach out via the Hugging Face discussion board or review our current repository of model weights and datasets.
Progress Report for Phase 1
Summary: Phase 1 is underway, but achieving a high-fidelity "Teacher" model for Philippine languages using Llama 3.1 and machine-translated Alpaca data is currently bottlenecked. Llama 3.1's inherent English-centric bias combined with syntactically flawed, machine-translated training data creates a compounding error loop. This results in grammatical corruption, dialect mixing, and severe hallucinations rather than true Neural Machine Translation (NMT) parity. There is still a long way to go to build a reliable teacher model; we must pivot away from machine-translated shortcuts and invest in human-curated, native-first datasets before progressing to knowledge distillation.
Organization: Philippine Languages Translation and AI Training Community
Project Phase: Phase 1 - Foundation Model Alignment and NMT Parity
Target Languages: Tagalog, Cebuano, Ilocano, Hiligaynon, Bicolano, Waray, Kapampangan, and Pangasinan
Date: [Current Date]
1. Executive Summary
In pursuit of bridging the digital divide for low-resource Philippine languages, our team has initiated Phase 1 of our technical roadmap. The primary objective is to utilize Supervised Fine-Tuning (SFT) and QLoRA on large-scale transformer architectures (Llama 3.1 series) to achieve Neural Machine Translation (NMT) parity with commercial benchmarks.
While the infrastructure for full-parameter tuning and SFT is fully operational, initial evaluations reveal critical bottlenecks. Specifically, the inherent limitations of the Llama 3.1 base model, combined with the use of machine-translated instruction datasets (such as the Alpaca dataset), have severely hindered our progress toward building a reliable "Teacher" model.
2. Current Progress
- Infrastructure: QLoRA and SFT pipelines have been successfully implemented.
- Model Selection: Llama 3.1 has been established as the primary foundational architecture.
- Data Ingestion: Initial fine-tuning has commenced using machine-translated versions of the Alpaca instruction-tuning dataset across the eight major regional languages.
3. Key Challenges & Technical Limitations
Despite successful pipeline implementation, achieving high-fidelity translation and complex instruction following is currently compromised by two compounding factors:
A. Inherent Limitations of the Llama 3.1 Architecture in Low-Resource Contexts
While Llama 3.1 is a highly capable state-of-the-art model, its pre-training corpus is overwhelmingly English-centric.
- Tokenization Inefficiency: The Llama 3.1 tokenizer is not optimized for the agglutinative and morphologically rich nature of Philippine languages (which rely heavily on complex prefixes, infixes, and suffixes). As a result, native words are fractured into an excessive number of tokens. This degrades the model's context window efficiency and severely impairs its syntactic reasoning.
- Latent Space Bias: Because the model's foundational weights lack deep representations of Philippine linguistics, it defaults to English-language logic and syntax, applying it unnaturally to regional dialects.
B. The Flaws of Machine-Translated Datasets (Alpaca)
To rapidly generate instruction-tuning data, the English Alpaca dataset was machine-translated into the target Philippine languages. From a multilingual perspective, this approach is highly problematic:
- Loss of Nuance and Context: Machine translation of the Alpaca dataset frequently results in literal, word-for-word translations. It fails to capture the cultural context, idioms, or appropriate levels of formality inherent to Philippine languages.
- Syntactic Corruption: Standard MT tools struggle with the Verb-Subject-Object (VSO) sentence structure common in Philippine languages, often forcing them into the English Subject-Verb-Object (SVO) structure. This trains the model on grammatically incorrect, "robotic" phrasing.
C. The Hallucination Loop (Multilingual Parity Failure)
When a fundamentally English-biased model (Llama 3.1) is fine-tuned using flawed, machine-translated training data, the result is a compounding error loop.
- Because the model lacks a strong native understanding of languages like Ilocano or Hiligaynon, it accepts the syntactically flawed machine-translated Alpaca data as absolute "ground truth."
- When prompted, the model attempts to map this flawed training data against its English-centric weights. This directly leads to severe hallucinations.
- Instead of achieving NMT parity, the model frequently generates outputs that mix dialects (e.g., confusing Tagalog and Cebuano vocabularies), fabricates non-existent words (morphological hallucinations), or confidently outputs nonsensical, direct translations of English idioms that hold no meaning to native speakers.
4. Implications for the "Teacher Model"
The ultimate milestone for Phase 1 is the creation of a validated "Teacher" model capable of high-fidelity translation, which will serve as the performance baseline for knowledge distillation into smaller, efficient, edge-ready models.
Based on current data, there is still a long way to go to achieve this milestone. A teacher model trained on machine-translated data currently propagates and amplifies grammatical errors and hallucinations. If we proceed to distillation using the current iteration of the model, we will inherently pass these hallucinations down to the smaller student models, rendering them unusable for actual native speakers and defeating the purpose of edge deployment.
5. Strategic Recommendations & Next Steps
To achieve true NMT parity and develop a high-fidelity Teacher model, we must pivot our data strategy:
- Move Beyond Machine Translation: We must deprecate reliance on purely machine-translated datasets like Alpaca.
- Human-in-the-Loop (HITL) Curation: Invest in community-driven, native-speaker verification of parallel corpora to ensure morphological and syntactic accuracy.
- Native Prompt Generation: Shift toward instruction-tuning datasets that are originally authored in the target languages, rather than translating English instructions.
- Vocabulary Expansion: Investigate vocabulary expansion and embedding initialization techniques for Llama 3.1 to improve tokenization efficiency for Philippine languages prior to SFT.
Conclusion:
Building high-performance NLP architectures for Philippine languages cannot rely on shortcuts like machine-translated datasets layered over English-centric models. Achieving the high-fidelity teacher model required for our roadmap demands a rigorous, culturally accurate, and native-first approach to data curation.
Current Status: Crowdsourced Authentic Dataset Generation Strategy
Summary: In response to the hallucination loop caused by machine-translated training data, stakeholders have pivoted towards authentic, native-first dataset curation. To facilitate this, we have developed the PLTAT Appโan all-in-one "Swiss Army knife" platform for crowdsourcing the translation, generation, evaluation, and correction of NLP datasets. Because building a high-fidelity teacher model is a long-term, iterative process, we are actively seeking institutional stakeholders (universities, government agencies) to sustain this effort. Technical resources, including the PLTAT Chat App and our Ollama Colab Server Notebook, are now live for community testing.
Organization: Philippine Languages Translation and AI Training Community (PLTAT)
Project Phase: Phase 1.5 - Authentic Data Remediation & HITL Integration
Date: April 6, 2026
1. Strategic Pivot: The Need for Authentic Datasets
Following the findings of the Phase 1 Progress Report, project stakeholders convened to address the severe limitations of fine-tuning the Llama 3.1 architecture using machine-translated data (e.g., the Alpaca dataset). It was unanimously decided that relying on automated translation pipelines results in syntactic corruption and severe hallucinations, preventing us from achieving Neural Machine Translation (NMT) parity.
The Resolution: We are shifting our methodology from automated data ingestion to authentic, human-verified dataset curation. To train a high-fidelity "Teacher" model capable of accurate knowledge distillation, the model must be trained on high-quality, native text authored and verified by fluent speakers of Tagalog, Cebuano, Ilocano, Hiligaynon, Bicolano, Waray, Kapampangan, and Pangasinan.
2. The Solution: The PLTAT App ("Swiss Army Knife" of Data Curation)
To solve the logistical challenge of building native datasets across eight languages, we have developed the PLTAT App. Designed as an all-in-one, crowdsourcing "Swiss Army knife," this platform empowers native speakers, linguists, and AI enthusiasts to actively participate in the model-training pipeline.
The PLTAT App features four core modules:
- Translate: Allows contributors to perform Human-in-the-Loop (HITL) translations of high-value English datasets, ensuring cultural nuance, correct verb-subject-object (VSO) sentence structures, and appropriate morphological alignments are maintained.
- Generate: Enables users to natively author brand new instruction-tuning prompts and responses directly in their regional languages, bypassing the English-translation bias entirely.
- Evaluate: A rating system where contributors can test the current iteration of the finetuned Llama 3.1 model, scoring its outputs for fluency, accuracy, and logic (laying the groundwork for future Reinforcement Learning from Human Feedback - RLHF).
- Correct: Provides a direct interface for users to edit and fix model hallucinations, dialect mixing, or grammatical errors. These corrections are automatically fed back into the dataset repository for the next training epoch.
3. Long-Term Vision & Institutional Stakeholder Engagement
Building an NMT-parity teacher model is not a problem that can be solved in a single fine-tuning run. It requires a continuous, iterative cycle of model training, evaluation, and data refinement.
While community crowdsourcing is the engine of the PLTAT App, this is a long-term effort that requires the backing of institutional stakeholders.
- We are calling upon academic institutions (linguistics and computer science departments), government bodies (such as DOST and the Komisyon sa Wikang Filipino), and regional NGOs to partner with PLTAT.
- Institutional backing will provide the necessary oversight for linguistic validation, ensure long-term computing resource sustainability, and help integrate these open-source models into public-serving digital infrastructure.
4. Technical Resources & Access
To democratize access to our current progress and facilitate immediate community contribution, we have deployed the following resources. The models currently hosted represent our baseline and will be iteratively updated as the PLTAT App generates higher-quality, authentic datasets.
Conclusion:
By combining the raw power of the Llama 3.1 architecture with the linguistic authenticity of crowdsourced, human-verified data via the PLTAT App, we are establishing a sustainable, long-term pathway toward true AI parity for Philippine languages.