747 lines
28 KiB
Markdown
747 lines
28 KiB
Markdown
# Product Requirements Document (PRD)
|
||
## IRT-Powered Adaptive Question Bank System
|
||
|
||
**Document Version:** 1.1
|
||
**Date:** March 21, 2026 (Updated)
|
||
**Product Name:** IRT Bank Soal (Adaptive Question Bank with AI Generation)
|
||
**Client:** Sejoli Tryout Multi-Website Platform
|
||
**Status:** Draft - Clarifications Incorporated
|
||
|
||
---
|
||
|
||
## Changelog
|
||
|
||
### v1.1 (March 21, 2026)
|
||
- Added **AI Generation**: 1 request = 1 question, no approval workflow
|
||
- Added **Admin Playground**: Admin can test AI generation without saving to DB
|
||
- Updated **Normalization Control**: Optional manual/automatic mode, system handles auto when sufficient data
|
||
- Updated **IRT → CTT Rollback**: Historical IRT scores preserved, CTT applied to new sessions only
|
||
- Removed **Admin Permissions/Role-based Access**: Not needed (each admin per site via WordPress)
|
||
- Updated **Custom Dashboards**: Use FastAPI Admin only (no custom dashboards)
|
||
- Added **AI Generation Toggle**: Global on/off switch for cost control
|
||
- Added **User-level Question Reuse**: Check if student already answered at difficulty level
|
||
- Updated **Student UX**: Admin sees internal metrics, students see only primary score
|
||
- Added **Data Retention**: Keep all data (no policy yet)
|
||
- Added **Reporting Section**: Student performance, Item analysis, Calibration status, Tryout comparison
|
||
- Updated **Admin Persona Note**: This project is backend tool for IRT/CTT calculation; WordPress handles static questions
|
||
|
||
---
|
||
|
||
## 1. Product Vision
|
||
|
||
### 1.1 Vision Statement
|
||
To provide an adaptive, intelligent question bank system that seamlessly integrates with Sejoli's existing Excel-based workflow while introducing modern Item Response Theory (IRT) capabilities and AI-powered question generation, enabling more accurate and efficient student assessment.
|
||
|
||
### 1.1.1 Primary Goals
|
||
- **100% Excel Compatibility**: Maintain exact formula compatibility with client's existing Excel workflow (CTT scoring with p, bobot, NM, NN)
|
||
- **Gradual Modernization**: Enable smooth transition from Classical Test Theory (CTT) to Item Response Theory (IRT)
|
||
- **Adaptive Assessment**: Provide Computerized Adaptive Testing (CAT) capabilities for more efficient and accurate measurement
|
||
- **AI-Enhanced Content**: Automatically generate question variants (Mudah/Sulit) from base Sedang questions
|
||
- **Multi-Site Support**: Single backend serving multiple WordPress-powered educational sites
|
||
- **Non-Destructive**: Zero disruption to existing operations - all enhancements are additive
|
||
|
||
### 1.1.2 Success Metrics
|
||
- **Technical**: CTT scores match client Excel 100%, IRT calibration >80% coverage
|
||
- **Educational**: 30% reduction in test length with IRT vs CTT, measurement precision (SE < 0.5 after 15 items)
|
||
- **Adoption**: >70% tryouts use hybrid mode within 3 months, >80% student satisfaction with adaptive mode
|
||
- **Efficiency**: 99.9% question reuse rate via AI-generated variants
|
||
|
||
---
|
||
|
||
## 2. User Personas
|
||
|
||
### 2.1 Administrators (School/Guru)
|
||
**Profile:** Non-technical education professionals managing tryouts
|
||
**Pain Points:**
|
||
- Excel-based scoring is manual and time-consuming
|
||
- Static questions require constant new content creation
|
||
- Difficulty normalization requires manual calculation
|
||
- Limited ability to compare student performance across groups
|
||
|
||
**Needs:**
|
||
- Simple, transparent scoring formulas (CTT mode)
|
||
- Easy Excel import/export workflow
|
||
- Clear visualizations of student performance
|
||
- Configurable normalization (static vs dynamic)
|
||
- Optional advanced features (IRT) without complexity
|
||
|
||
### 2.2 Students
|
||
**Profile:** Students taking tryouts for assessment
|
||
**Pain Points:**
|
||
- Fixed-length tests regardless of ability level
|
||
- Question difficulty may not match their skill
|
||
- Long testing sessions with low-value questions
|
||
|
||
**Needs:**
|
||
- Adaptive tests that match their ability level
|
||
- Shorter, more efficient assessment
|
||
- Clear feedback on strengths/weaknesses
|
||
- Consistent scoring across attempts
|
||
|
||
### 2.3 Content Creators
|
||
**Profile:** Staff creating and managing question banks
|
||
**Pain Points:**
|
||
- Creating 3 difficulty variants per question is time-consuming
|
||
- Limited question pool for repeated assessments
|
||
- Manual categorization of difficulty levels
|
||
|
||
**Needs:**
|
||
- AI-assisted question generation
|
||
- Easy difficulty level adjustment
|
||
- Reuse of base questions with variant generation
|
||
- Bulk question management tools
|
||
|
||
### 2.4 Technical Administrators
|
||
**Profile:** IT staff managing the platform
|
||
**Pain Points:**
|
||
- Multiple WordPress sites with separate databases
|
||
- Difficulty scaling question pools
|
||
- Maintenance of complex scoring systems
|
||
|
||
**Needs:**
|
||
- Centralized backend for multiple sites
|
||
- Scalable architecture (AA-panel VPS)
|
||
- REST API for WordPress integration
|
||
- Automated calibration and normalization
|
||
- **Note**: Each admin manages static questions within WordPress; this project provides the backend tool for IRT/CTT calculation and dynamic question selection
|
||
|
||
---
|
||
|
||
## 3. Functional Requirements
|
||
|
||
### 3.1 CTT Scoring (Classical Test Theory)
|
||
**FR-1.1** System must calculate tingkat kesukaran (p) per question using exact client Excel formula:
|
||
```
|
||
p = Σ Benar / Total Peserta
|
||
```
|
||
**Acceptance Criteria:**
|
||
- p-value calculated per question for each tryout
|
||
- Values stored in database (items.ctt_p)
|
||
- Results match client Excel to 4 decimal places
|
||
|
||
**FR-1.2** System must calculate bobot (weight) per question:
|
||
```
|
||
Bobot = 1 - p
|
||
```
|
||
**Acceptance Criteria:**
|
||
- Bobot calculated and stored (items.ctt_bobot)
|
||
- Easy questions (p > 0.70) have low bobot (< 0.30)
|
||
- Difficult questions (p < 0.30) have high bobot (> 0.70)
|
||
|
||
**FR-1.3** System must calculate Nilai Mentah (NM) per student:
|
||
```
|
||
NM = (Total_Bobot_Siswa / Total_Bobot_Max) × 1000
|
||
```
|
||
**Acceptance Criteria:**
|
||
- NM ranges 0-1000
|
||
- SUMPRODUCT equivalent implemented correctly
|
||
- Results stored per response (user_answers.ctt_nm)
|
||
|
||
**FR-1.4** System must calculate Nilai Nasional (NN) with normalization:
|
||
```
|
||
NN = 500 + 100 × ((NM - Rataan) / SB)
|
||
```
|
||
**Acceptance Criteria:**
|
||
- NN normalized to mean=500, SD=100
|
||
- Support static (hardcoded rataan/SB) and dynamic (real-time) modes
|
||
- NN clipped to 0-1000 range
|
||
|
||
**FR-1.5** System must categorize question difficulty per CTT standards:
|
||
- p < 0.30 → Sukar (Sulit)
|
||
- 0.30 ≤ p ≤ 0.70 → Sedang
|
||
- p > 0.70 → Mudah
|
||
**Acceptance Criteria:**
|
||
- Category assigned (items.ctt_category)
|
||
- Used for level field (items.level)
|
||
|
||
### 3.2 IRT Scoring (Item Response Theory)
|
||
**FR-2.1** System must implement 1PL Rasch model:
|
||
```
|
||
P(θ) = 1 / (1 + e^-(θ - b))
|
||
```
|
||
**Acceptance Criteria:**
|
||
- θ (ability) estimated per student
|
||
- b (difficulty) calibrated per question
|
||
- Ranges: θ, b ∈ [-3, +3]
|
||
|
||
**FR-2.2** System must estimate θ using Maximum Likelihood Estimation (MLE)
|
||
**Acceptance Criteria:**
|
||
- Initial guess θ = 0
|
||
- Optimization bounds [-3, +3]
|
||
- Standard error (SE) calculated using Fisher information
|
||
|
||
**FR-2.3** System must calibrate b parameters from response data
|
||
**Acceptance Criteria:**
|
||
- Minimum 100-500 responses per item for calibration
|
||
- Calibration status tracked (items.calibrated)
|
||
- Auto-convert CTT p to initial b: `b ≈ -ln((1-p)/p)`
|
||
|
||
**FR-2.4** System must map θ to NN for CTT comparison
|
||
**Acceptance Criteria:**
|
||
- θ ∈ [-3, +3] mapped to NN ∈ [0, 1000]
|
||
- Formula: `NN = 500 + (θ / 3) × 500`
|
||
- Secondary score returned in API responses
|
||
|
||
### 3.3 Hybrid Mode
|
||
**FR-3.1** System must support dual scoring (CTT + IRT parallel)
|
||
**Acceptance Criteria:**
|
||
- Both scores calculated per response
|
||
- Primary/secondary score returned
|
||
- Admin can choose which to display
|
||
|
||
**FR-3.2** System must support hybrid item selection
|
||
**Acceptance Criteria:**
|
||
- First N items: fixed order (CTT mode)
|
||
- Remaining items: adaptive (IRT mode)
|
||
- Configurable transition point (tryout_config.hybrid_transition_slot)
|
||
|
||
**FR-3.3** System must support hybrid normalization
|
||
**Acceptance Criteria:**
|
||
- Static mode for small samples (< threshold)
|
||
- Dynamic mode for large samples (≥ threshold)
|
||
- Configurable threshold (tryout_config.min_sample_for_dynamic)
|
||
|
||
### 3.4 Dynamic Normalization
|
||
**FR-4.1** System must maintain running statistics per tryout
|
||
**Acceptance Criteria:**
|
||
- Track: participant_count, total_nm_sum, total_nm_sq_sum
|
||
- Update on each completed session
|
||
- Stored in tryout_stats table
|
||
|
||
**FR-4.2** System must calculate real-time rataan and SB
|
||
**Acceptance Criteria:**
|
||
- Rataan = mean(all NM)
|
||
- SB = sqrt(variance(all NM))
|
||
- Updated incrementally (no full recalc)
|
||
|
||
**FR-4.3** System must support optional normalization control (manual vs automatic)
|
||
**Acceptance Criteria:**
|
||
- Admin can choose manual mode (static normalization with hardcoded values)
|
||
- Admin can choose automatic mode (dynamic normalization when sufficient data)
|
||
- When automatic selected and sufficient data reached: system handles normalization automatically
|
||
- Configurable threshold: min_sample_for_dynamic (default: 100)
|
||
- Admin can switch between manual/automatic at any time
|
||
- System displays current data readiness (participant count vs threshold)
|
||
|
||
### 3.5 AI Question Generation
|
||
**FR-5.1** System must generate question variants via OpenRouter API
|
||
**Acceptance Criteria:**
|
||
- Generate Mudah variant from Sedang base
|
||
- Generate Sulit variant from Sedang base
|
||
- Generate same-level variant from Sedang base
|
||
- Use Qwen3 Coder 480B or Llama 3.3 70B
|
||
- **1 request = 1 question** (not batch generation)
|
||
|
||
**FR-5.2** System must use standardized prompt template
|
||
**Acceptance Criteria:**
|
||
- Include context (tryout_id, slot, level)
|
||
- Include basis soal for reference (provides topic/context)
|
||
- Request 1 question with 4 options
|
||
- Include explanation
|
||
- Maintain same context, vary only difficulty level
|
||
|
||
**FR-5.3** System must implement question reuse/caching with user-level tracking
|
||
**Acceptance Criteria:**
|
||
- Check DB for existing variant before generating
|
||
- Check if student user_id already answered question at specific difficulty level
|
||
- Reuse if found (same tryout_id, slot, level)
|
||
- Generate only if cache miss OR user hasn't answered at this difficulty
|
||
|
||
**FR-5.4** System must provide admin playground for AI testing
|
||
**Acceptance Criteria:**
|
||
- Admin can request AI generation without saving to database
|
||
- Admin can re-request unlimited times until satisfied (no approval workflow)
|
||
- Preview mode shows generated question before saving
|
||
- Admin can edit content before saving
|
||
- Purpose: Build admin trust in AI quality before enabling for students
|
||
|
||
**FR-5.5** System must parse and store AI-generated questions
|
||
**Acceptance Criteria:**
|
||
- Parse stem, options, correct answer, explanation
|
||
- Store in items table with generated_by='ai'
|
||
- Link to basis_item_id
|
||
- No approval workflow required for student tests
|
||
|
||
**FR-5.6** System must support AI generation toggle
|
||
**Acceptance Criteria:**
|
||
- Global toggle to enable/disable AI generation (config.AI_generation_enabled)
|
||
- When disabled: reuse DB questions regardless of repetition
|
||
- When enabled: generate new variants if cache miss
|
||
- Admin can toggle on/off based on cost/budget
|
||
|
||
### 3.6 Item Selection
|
||
**FR-6.1** System must support fixed order selection (CTT mode)
|
||
**Acceptance Criteria:**
|
||
- Items delivered in slot order (1, 2, 3, ...)
|
||
- No adaptive logic
|
||
- Used when selection_mode='fixed'
|
||
|
||
**FR-6.2** System must support adaptive selection (IRT mode)
|
||
**Acceptance Criteria:**
|
||
- Select item where b ≈ current θ
|
||
- Prioritize calibrated items
|
||
- Use item information to maximize precision
|
||
|
||
**FR-6.3** System must support level-based selection (hybrid mode)
|
||
**Acceptance Criteria:**
|
||
- Select from specified level (Mudah/Sedang/Sulit)
|
||
- Check if level variant exists in DB
|
||
- Generate via AI if not exists
|
||
|
||
### 3.7 Excel Import
|
||
**FR-7.1** System must import from client Excel format
|
||
**Acceptance Criteria:**
|
||
- Parse answer key (Row 2, KUNCI)
|
||
- Extract calculated p-values (Row 4, data_only=True)
|
||
- Extract bobot values (Row 5)
|
||
- Import student responses (Row 6+)
|
||
|
||
**FR-7.2** System must create items from Excel import
|
||
**Acceptance Criteria:**
|
||
- Create item per question slot
|
||
- Set ctt_p, ctt_bobot, ctt_category
|
||
- Auto-calculate irt_b from ctt_p
|
||
- Set calibrated=False
|
||
|
||
**FR-7.3** System must configure tryout from Excel import
|
||
**Acceptance Criteria:**
|
||
- Create tryout_config with CTT settings
|
||
- Set normalization_mode='static' (default)
|
||
- Set static_rataan=500, static_sb=100
|
||
|
||
### 3.8 API Endpoints
|
||
**FR-8.1** System must provide Next Item endpoint
|
||
**Acceptance Criteria:**
|
||
- POST /api/v1/session/{session_id}/next_item
|
||
- Accept mode (ctt/irt/hybrid)
|
||
- Accept current_responses array
|
||
- Return item with selection_method metadata
|
||
|
||
**FR-8.2** System must provide Complete Session endpoint
|
||
**Acceptance Criteria:**
|
||
- POST /api/v1/session/{session_id}/complete
|
||
- Return primary_score (CTT or IRT)
|
||
- Return secondary_score (parallel calculation)
|
||
- Return comparison (NN difference, agreement)
|
||
|
||
**FR-8.3** System must provide Get Tryout Config endpoint
|
||
**Acceptance Criteria:**
|
||
- GET /api/v1/tryout/{tryout_id}/config
|
||
- Return scoring_mode, normalization_mode
|
||
- Return current_stats (participant_count, rataan, SB)
|
||
- Return calibration_status
|
||
|
||
**FR-8.4** System must provide Update Normalization endpoint
|
||
**Acceptance Criteria:**
|
||
- PUT /api/v1/tryout/{tryout_id}/normalization
|
||
- Accept normalization_mode update
|
||
- Accept static_rataan, static_sb overrides
|
||
- Return will_switch_to_dynamic_at threshold
|
||
|
||
### 3.9 Multi-Site Support
|
||
**FR-9.1** System must support multiple WordPress sites
|
||
**Acceptance Criteria:**
|
||
- Each site has unique website_id
|
||
- Shared backend, isolated data per site
|
||
- API responses scoped to website_id
|
||
|
||
**FR-9.2** System must support per-site configuration
|
||
**Acceptance Criteria:**
|
||
- Each (website_id, tryout_id) pair unique
|
||
- Independent tryout_config per tryout
|
||
- Independent tryout_stats per tryout
|
||
|
||
---
|
||
|
||
## 4. Non-Functional Requirements
|
||
|
||
### 4.1 Performance
|
||
**NFR-4.1.1** Next Item API response time < 500ms
|
||
**NFR-4.1.2** Complete Session API response time < 2s
|
||
**NFR-4.1.3** AI question generation < 10s (OpenRouter timeout)
|
||
**NFR-4.1.4** Support 1000 concurrent students
|
||
|
||
### 4.2 Scalability
|
||
**NFR-4.2.1** Support 10,000+ items in database
|
||
**NFR-4.2.2** Support 100,000+ student responses
|
||
**NFR-4.2.3** Question reuse: 99.9% cache hit rate after initial generation
|
||
**NFR-4.2.4** Horizontal scaling via PostgreSQL read replicas
|
||
|
||
### 4.3 Reliability
|
||
**NFR-4.3.1** 99.9% uptime for tryout periods
|
||
**NFR-4.3.2** Automatic fallback to CTT if IRT fails
|
||
**NFR-4.3.3** Database transaction consistency
|
||
**NFR-4.3.4** Graceful degradation if AI API unavailable
|
||
|
||
### 4.4 Security
|
||
**NFR-4.4.1** API authentication via WordPress tokens
|
||
**NFR-4.4.2** Website_id isolation (no cross-site data access)
|
||
**NFR-4.4.3** Rate limiting per API key
|
||
**NFR-4.4.4** Audit trail for all scoring changes
|
||
|
||
### 4.5 Compatibility
|
||
**NFR-4.5.1** 100% formula match with client Excel
|
||
**NFR-4.5.2** Non-destructive: zero data loss during transitions
|
||
**NFR-4.5.3** Reversible: can disable IRT features anytime
|
||
**NFR-4.5.4** WordPress REST API integration
|
||
|
||
### 4.6 Maintainability
|
||
**NFR-4.6.1** FastAPI Admin auto-generated UI for CRUD
|
||
**NFR-4.6.2** Alembic migrations for schema changes
|
||
**NFR-4.6.3** Comprehensive API documentation (OpenAPI)
|
||
**NFR-4.6.4** Logging for debugging scoring calculations
|
||
|
||
---
|
||
|
||
## 5. Data Requirements
|
||
|
||
### 5.1 Core Entities
|
||
|
||
#### Items
|
||
- **id**: Primary key
|
||
- **website_id, tryout_id**: Composite key for multi-site
|
||
- **slot, level**: Position and difficulty
|
||
- **stem, options, correct, explanation**: Question content
|
||
- **ctt_p, ctt_bobot, ctt_category**: CTT parameters
|
||
- **irt_b, irt_a, irt_c**: IRT parameters
|
||
- **calibrated, calibration_sample_size**: Calibration status
|
||
- **generated_by, ai_model, basis_item_id**: AI generation metadata
|
||
|
||
#### User Answers
|
||
- **id**: Primary key
|
||
- **wp_user_id, website_id, tryout_id, slot, level**: Composite key
|
||
- **item_id, response**: Question and answer
|
||
- **ctt_bobot_earned, ctt_total_bobot_cumulative, ctt_nm, ctt_nn**: CTT scores
|
||
- **rataan_used, sb_used, normalization_mode_used**: Normalization metadata
|
||
- **irt_theta, irt_theta_se, irt_information**: IRT scores
|
||
- **scoring_mode_used**: Which mode was used
|
||
|
||
#### Tryout Config
|
||
- **id**: Primary key
|
||
- **website_id, tryout_id**: Composite key
|
||
- **scoring_mode**: 'ctt', 'irt', 'hybrid'
|
||
- **selection_mode**: 'fixed', 'adaptive', 'hybrid'
|
||
- **normalization_mode**: 'static', 'dynamic', 'hybrid'
|
||
- **static_rataan, static_sb, min_sample_for_dynamic**: Normalization settings
|
||
- **min_calibration_sample, theta_estimation_method**: IRT settings
|
||
- **hybrid_transition_slot, fallback_to_ctt_on_error**: Transition settings
|
||
|
||
#### Tryout Stats
|
||
- **id**: Primary key
|
||
- **website_id, tryout_id**: Composite key
|
||
- **participant_count**: Number of completed sessions
|
||
- **total_nm_sum, total_nm_sq_sum**: Running sums for mean/SD calc
|
||
- **current_rataan, current_sb**: Calculated values
|
||
- **min_nm, max_nm**: Score range
|
||
- **last_calculated_at, last_participant_id**: Metadata
|
||
|
||
### 5.2 Data Relationships
|
||
- Items → User Answers (1:N, CASCADE delete)
|
||
- Items → Items (self-reference via basis_item_id for AI generation)
|
||
- Tryout Config → User Answers (1:N via website_id, tryout_id)
|
||
- Tryout Stats → User Answers (1:N via website_id, tryout_id)
|
||
|
||
---
|
||
|
||
## 6. Technical Constraints
|
||
|
||
### 6.1 Tech Stack (Fixed)
|
||
- **Backend**: FastAPI (Python)
|
||
- **Database**: PostgreSQL (via aaPanel PgSQL Manager)
|
||
- **ORM**: SQLAlchemy
|
||
- **Admin**: FastAPI Admin
|
||
- **AI**: OpenRouter API (Qwen3 Coder 480B, Llama 3.3 70B)
|
||
- **Deployment**: aaPanel VPS (Python Manager)
|
||
|
||
### 6.2 External Dependencies
|
||
- **OpenRouter API**: Must handle rate limits, timeouts, errors
|
||
- **WordPress**: REST API integration, authentication
|
||
- **Excel**: openpyxl for import, pandas for data processing
|
||
|
||
### 6.3 Mathematical Constraints
|
||
- **CTT**: Must use EXACT client formulas (p, bobot, NM, NN)
|
||
- **IRT**: 1PL Rasch model only (no a, c parameters initially)
|
||
- **Normalization**: Mean=500, SD=100 target
|
||
- **Ranges**: θ, b ∈ [-3, +3], NM, NN ∈ [0, 1000]
|
||
|
||
---
|
||
|
||
## 7. User Stories
|
||
|
||
### 7.1 Administrator Stories
|
||
**US-7.1.1** As an administrator, I want to import questions from Excel so that I can migrate existing content without manual entry.
|
||
- Priority: High
|
||
- Acceptance: FR-7.1, FR-7.2, FR-7.3
|
||
|
||
**US-7.1.2** As an administrator, I want to configure normalization mode (static/dynamic/hybrid) so that I can control how scores are normalized.
|
||
- Priority: High
|
||
- Acceptance: FR-4.3, FR-8.4
|
||
|
||
**US-7.1.3** As an administrator, I want to view calibration status so that I can know when IRT is ready for production.
|
||
- Priority: Medium
|
||
- Acceptance: FR-8.3
|
||
|
||
**US-7.1.4** As an administrator, I want to choose scoring mode (CTT/IRT/hybrid) so that I can gradually adopt advanced features.
|
||
- Priority: High
|
||
- Acceptance: FR-3.1, FR-3.2, FR-3.3
|
||
|
||
### 7.2 Student Stories
|
||
**US-7.2.1** As a student, I want to take adaptive tests so that I get questions matching my ability level.
|
||
- Priority: High
|
||
- Acceptance: FR-6.2, FR-2.1, FR-2.2
|
||
|
||
**US-7.2.2** As a student, I want to see my normalized score (NN) so that I can compare my performance with others.
|
||
- Priority: High
|
||
- Acceptance: FR-1.4, FR-4.2
|
||
|
||
**US-7.2.3** As a student, I want a seamless experience where any technical issues (IRT fallback, AI generation failures) are handled without interrupting my test.
|
||
- Priority: High
|
||
- Acceptance: Seamless fallback (student unaware of internal mode switching), no error messages visible to students
|
||
|
||
### 7.3 Content Creator Stories
|
||
**US-7.3.1** As a content creator, I want to generate question variants via AI so that I don't have to manually create 3 difficulty levels.
|
||
- Priority: High
|
||
- Acceptance: FR-5.1, FR-5.2, FR-5.3, FR-5.4
|
||
|
||
**US-7.3.2** As a content creator, I want to reuse existing questions with different difficulty levels so that I can maximize question pool efficiency.
|
||
- Priority: Medium
|
||
- Acceptance: FR-5.3, FR-6.3
|
||
|
||
### 7.4 Technical Administrator Stories
|
||
**US-7.4.1** As a technical administrator, I want to manage multiple WordPress sites from one backend so that I don't have to duplicate infrastructure.
|
||
- Priority: High
|
||
- Acceptance: FR-9.1, FR-9.2
|
||
|
||
**US-7.4.2** As a technical administrator, I want to monitor calibration progress so that I can plan IRT rollout.
|
||
- Priority: Medium
|
||
- Acceptance: FR-2.3, FR-8.3
|
||
|
||
**US-7.4.3** As a technical administrator, I want access to internal scoring details (CTT vs IRT comparison, normalization metrics) for debugging and monitoring, while students only see primary scores.
|
||
- Priority: Medium
|
||
- Acceptance: Admin visibility of all internal metrics, student visibility limited to final NN score only
|
||
|
||
---
|
||
|
||
## 8. Success Criteria
|
||
|
||
### 8.1 Technical Validation
|
||
- ✅ CTT scores match client Excel to 4 decimal places (100% formula accuracy)
|
||
- ✅ Dynamic normalization produces mean=500±5, SD=100±5 after 100 users
|
||
- ✅ IRT calibration covers >80% items with 500+ responses per item
|
||
- ✅ CTT vs IRT NN difference <20 points (moderate agreement)
|
||
- ✅ Fallback rate <5% (IRT → CTT on error)
|
||
|
||
### 8.2 Educational Validation
|
||
- ✅ IRT measurement precision: SE <0.5 after 15 items
|
||
- ✅ Normalization quality: Distribution skewness <0.5
|
||
- ✅ Adaptive efficiency: 30% reduction in test length (15 IRT = 30 CTT items for same precision)
|
||
- ✅ Student satisfaction: >80% prefer adaptive mode in surveys
|
||
- ✅ Admin adoption: >70% tryouts use hybrid mode within 3 months
|
||
|
||
### 8.3 Business Validation
|
||
- ✅ Zero data loss during CTT→IRT transition
|
||
- ✅ Reversible: Can disable IRT and revert to CTT anytime
|
||
- ✅ Non-destructive: Existing Excel workflow remains functional
|
||
- ✅ Cost efficiency: 99.9% question reuse vs 90,000 unique questions for 1000 users
|
||
- ✅ Multi-site scalability: One backend supports unlimited WordPress sites
|
||
|
||
---
|
||
|
||
## 9. Risk Mitigation
|
||
|
||
### 9.1 Technical Risks
|
||
| Risk | Impact | Probability | Mitigation |
|
||
|------|--------|-------------|------------|
|
||
| IRT calibration fails (insufficient data) | High | Medium | Fallback to CTT mode, enable hybrid transition |
|
||
| OpenRouter API down/unavailable | Medium | Low | Cache questions, serve static variants |
|
||
| Excel formula mismatch | High | Low | Unit tests with client Excel data |
|
||
| Database performance degradation | Medium | Low | Indexing, read replicas, query optimization |
|
||
|
||
### 9.2 Business Risks
|
||
| Risk | Impact | Probability | Mitigation |
|
||
|------|--------|-------------|------------|
|
||
| Administrators refuse to use IRT (too complex) | High | Medium | Hybrid mode with CTT-first UI |
|
||
| Students dislike adaptive tests | Medium | Low | A/B testing, optional mode |
|
||
| Excel workflow changes (client updates) | High | Low | Version control, flexible import parser |
|
||
| Multi-site data isolation failure | Critical | Low | Website_id validation, RBAC |
|
||
|
||
---
|
||
|
||
## 10. Migration Strategy
|
||
|
||
### 10.1 Phase 1: Import Existing Data (Week 1)
|
||
- Export current Sejoli Tryout data to Excel
|
||
- Run import script to load items and configurations
|
||
- Configure CTT mode with static normalization
|
||
- Validate: CTT scores match Excel 100%
|
||
|
||
### 10.2 Phase 2: Collect Calibration Data (Week 2-4)
|
||
- Students use tryout normally (CTT mode)
|
||
- Backend logs all responses
|
||
- Monitor calibration progress (items.calibrated status)
|
||
- Collect running statistics (tryout_stats)
|
||
|
||
### 10.3 Phase 3: Enable Dynamic Normalization (Week 5)
|
||
- Check participant count ≥ 100
|
||
- Update normalization_mode='hybrid'
|
||
- Test with 10-20 new students
|
||
- Verify: Normalized distribution has mean≈500, SD≈100
|
||
|
||
### 10.4 Phase 4: Enable IRT Adaptive (Week 6+)
|
||
- After 90% items calibrated + 1000+ responses
|
||
- Update scoring_mode='irt', selection_mode='adaptive'
|
||
- Enable AI generation for Mudah/Sulit variants
|
||
- Monitor fallback rate, measurement precision
|
||
|
||
### 10.5 Rollback Plan
|
||
- Any phase is reversible
|
||
- Revert to CTT mode if IRT issues occur
|
||
- **Score preservation**: Historical IRT scores kept as-is; CTT applied only to new sessions after rollback
|
||
- Disable AI generation if costs too high
|
||
- Revert to static normalization if dynamic unstable
|
||
|
||
---
|
||
|
||
## 11. Future Enhancements
|
||
|
||
### 11.1 Short-term (3-6 months)
|
||
- **2PL/3PL IRT**: Add discrimination (a) and guessing (c) parameters
|
||
- **Item Response Categorization**: Bloom's Taxonomy, cognitive domains
|
||
- **Advanced AI Models**: Fine-tune models for specific subjects
|
||
- **Data Retention Policy**: Define archival and anonymization strategy (currently: keep all data)
|
||
|
||
### 11.2 Long-term (6-12 months)
|
||
- **Multi-dimensional IRT**: Measure multiple skills per question
|
||
- **Automatic Item Difficulty Adjustment**: AI calibrates b parameters
|
||
- **Predictive Analytics**: Student performance forecasting
|
||
- **Integration with LMS**: Moodle, Canvas API support
|
||
|
||
---
|
||
|
||
## 12. Glossary
|
||
|
||
| Term | Definition |
|
||
|------|------------|
|
||
| **p (TK)** | Proportion correct / Tingkat Kesukaran (CTT difficulty) |
|
||
| **Bobot** | 1-p weight (CTT scoring weight) |
|
||
| **NM** | Nilai Mentah (raw score 0-1000) |
|
||
| **NN** | Nilai Nasional (normalized 500±100) |
|
||
| **Rataan** | Mean of NM scores |
|
||
| **SB** | Simpangan Baku (standard deviation of NM) |
|
||
| **θ (theta)** | IRT ability (-3 to +3) |
|
||
| **b** | IRT difficulty (-3 to +3) |
|
||
| **SE** | Standard error (precision) |
|
||
| **CAT** | Computerized Adaptive Testing |
|
||
| **MLE** | Maximum Likelihood Estimation |
|
||
| **CTT** | Classical Test Theory |
|
||
| **IRT** | Item Response Theory |
|
||
|
||
---
|
||
|
||
## 13. Appendices
|
||
|
||
### 13.1 Formula Reference
|
||
- **CTT p**: `p = Σ Benar / Total Peserta`
|
||
- **CTT Bobot**: `Bobot = 1 - p`
|
||
- **CTT NM**: `NM = (Total_Bobot_Siswa / Total_Bobot_Max) × 1000`
|
||
- **CTT NN**: `NN = 500 + 100 × ((NM - Rataan) / SB)`
|
||
- **IRT 1PL**: `P(θ) = 1 / (1 + e^-(θ - b))`
|
||
- **CTT→IRT conversion**: `b ≈ -ln((1-p)/p)`
|
||
- **θ→NN mapping**: `NN = 500 + (θ / 3) × 500`
|
||
|
||
### 13.2 Difficulty Categories
|
||
| CTT p | CTT Category | Level | IRT b Range |
|
||
|-------|--------------|-------|-------------|
|
||
| p < 0.30 | Sukar | Sulit | b > 0.85 |
|
||
| 0.30 ≤ p ≤ 0.70 | Sedang | Sedang | -0.85 ≤ b ≤ 0.85 |
|
||
| p > 0.70 | Mudah | Mudah | b < -0.85 |
|
||
|
||
### 13.3 API Quick Reference
|
||
- `POST /api/v1/session/{session_id}/next_item` - Get next question
|
||
- `POST /api/v1/session/{session_id}/complete` - Submit and score
|
||
- `GET /api/v1/tryout/{tryout_id}/config` - Get configuration
|
||
- `PUT /api/v1/tryout/{tryout_id}/normalization` - Update normalization
|
||
|
||
---
|
||
|
||
## 14. Reporting Requirements
|
||
|
||
### 14.1 Student Performance Reports
|
||
**FR-14.1.1** System must provide individual student performance reports
|
||
**Acceptance Criteria:**
|
||
- Report all student sessions (CTT, IRT, hybrid)
|
||
- Include NM, NN scores per session
|
||
- Include time spent per question
|
||
- Include total_benar, total_bobot_earned
|
||
- Export to CSV/Excel
|
||
|
||
**FR-14.1.2** System must provide aggregate student performance reports
|
||
**Acceptance Criteria:**
|
||
- Group by tryout, website_id, date range
|
||
- Show average NM, NN, theta per group
|
||
- Show distribution (min, max, median, std dev)
|
||
- Show pass/fail rates
|
||
- Export to CSV/Excel
|
||
|
||
### 14.2 Item Analysis Reports
|
||
**FR-14.2.1** System must provide item difficulty reports
|
||
**Acceptance Criteria:**
|
||
- Show CTT p-value per item
|
||
- Show IRT b-parameter per item
|
||
- Show calibration status
|
||
- Show discrimination index (if available)
|
||
- Filter by difficulty category (Mudah/Sedang/Sulit)
|
||
|
||
**FR-14.2.2** System must provide item information function reports
|
||
**Acceptance Criteria:**
|
||
- Show item information value at different theta levels
|
||
- Visualize item characteristic curves (optional)
|
||
- Show optimal theta range for each item
|
||
|
||
### 14.3 Calibration Status Reports
|
||
**FR-14.3.1** System must provide calibration progress reports
|
||
**Acceptance Criteria:**
|
||
- Show total items per tryout
|
||
- Show calibrated items count and percentage
|
||
- Show items awaiting calibration
|
||
- Show average calibration sample size
|
||
- Show estimated time to reach calibration threshold
|
||
- Highlight ready-for-IRT rollout status (≥90% calibrated)
|
||
|
||
### 14.4 Tryout Comparison Reports
|
||
**FR-14.4.1** System must provide tryout comparison across dates
|
||
**Acceptance Criteria:**
|
||
- Compare NM/NN distributions across different tryout dates
|
||
- Show trends over time (e.g., monthly averages)
|
||
- Show normalization changes impact (static → dynamic)
|
||
|
||
**FR-14.4.2** System must provide tryout comparison across subjects
|
||
**Acceptance Criteria:**
|
||
- Compare performance across different subjects (Mat SD vs Bahasa SMA)
|
||
- Show subject-specific calibration status
|
||
- Show IRT accuracy differences per subject
|
||
|
||
### 14.5 Reporting Infrastructure
|
||
**FR-14.5.1** System must provide report scheduling
|
||
**Acceptance Criteria:**
|
||
- Admin can schedule daily/weekly/monthly reports
|
||
- Reports emailed to admin on schedule
|
||
- Report templates configurable (e.g., calibration status every Monday)
|
||
|
||
**FR-14.5.2** System must provide report export formats
|
||
**Acceptance Criteria:**
|
||
- Export to CSV
|
||
- Export to Excel (.xlsx)
|
||
- Export to PDF (with charts if available)
|
||
|
||
---
|
||
|
||
**Document End**
|
||
|
||
**Document Version:** 1.1
|
||
**Created:** March 21, 2026
|
||
**Updated:** March 21, 2026 (Clarifications Incorporated)
|
||
**Author:** Product Team (based on Technical Specification v1.2.0)
|
||
**Status:** Draft - Ready for Implementation
|
||
**Status:** Draft for Review
|