Перейти к основному содержимому
Перейти к основному содержимому

Error 439: CANNOT_SCHEDULE_TASK

подсказка

This error occurs when ClickHouse cannot allocate a new thread from the thread pool to execute a task. It indicates that either the thread pool is exhausted (all threads busy), system thread limit is reached, or the OS cannot create new threads.

Most common causes

  1. Thread pool exhausted

    • All threads in pool busy with active tasks
    • Too many concurrent queries requesting threads
    • Thread pool size limit reached (threads = max pool size)
    • Jobs queued waiting for available threads
  2. System thread limit reached

    • OS kernel thread limit exceeded
    • ulimit -u (max user processes) reached
    • System-wide thread limit hit
    • Container or cgroup thread limit reached
  3. High query concurrency with max_threads settings

    • Many queries each requesting max_threads threads
    • max_insert_threads setting too high with many concurrent inserts
    • Thread demand exceeds available thread pool capacity
    • Spike in concurrent query workload
  4. Resource exhaustion

    • System cannot allocate memory for new threads
    • Out of memory for thread stack allocation
    • System resource limits preventing thread creation
    • Container memory limits affecting thread creation
  5. Misconfigured thread pool settings

    • max_thread_pool_size set too low for workload
    • Thread pool not properly sized for concurrent queries
    • Imbalance between query concurrency and thread availability

Common solutions

1. Check current thread usage

-- View current thread pool status
SELECT 
    metric,
    value
FROM system.metrics
WHERE metric LIKE '%Thread%'
ORDER BY metric;

-- Key metrics to check:
-- QueryPipelineExecutorThreads - active query execution threads
-- QueryPipelineExecutorThreadsActive - threads currently executing
-- GlobalThread - total threads in global pool

2. Check thread pool configuration

-- View thread pool settings
SELECT 
    name,
    value,
    description
FROM system.server_settings
WHERE name LIKE '%thread%'
ORDER BY name;

-- Key settings:
-- max_thread_pool_size - maximum threads in global pool
-- max_thread_pool_free_size - idle threads kept in pool
-- thread_pool_queue_size - max tasks waiting in queue

3. Reduce per-query thread usage

-- Limit threads for specific query
SELECT * FROM large_table
SETTINGS max_threads = 4;

-- Reduce insert threads
INSERT INTO table
SETTINGS max_insert_threads = 4;

-- Set user-level defaults
ALTER USER your_user SETTINGS max_threads = 8;

4. Check system thread limits

# Check current thread limits
ulimit -u

# Check system-wide limits
cat /proc/sys/kernel/threads-max
cat /proc/sys/kernel/pid_max

# Check current thread count
ps -eLf | wc -l

# For containers, check cgroup limits
cat /sys/fs/cgroup/pids/pids.max

5. Enable concurrency control (if available)

-- Check concurrency control settings
SELECT 
    name,
    value
FROM system.server_settings
WHERE name LIKE '%concurrent_threads_soft_limit%';

-- concurrent_threads_soft_limit_ratio_to_cores - limits threads per core
-- concurrent_threads_soft_limit_num - absolute thread limit
примечание

Concurrency control was broken in versions before October 2024 fix. Fixed properly in 24.10+.

6. Monitor and limit concurrent queries

-- Check concurrent query count
SELECT count() AS concurrent_queries
FROM system.processes;

-- Limit concurrent queries per user
ALTER USER your_user SETTINGS max_concurrent_queries_for_user = 10;

-- Check thread usage per query
SELECT 
    query_id,
    user,
    ProfileEvents['QueryPipelineExecutorThreads'] AS threads_used,
    query
FROM system.processes
ORDER BY threads_used DESC;

Common scenarios

Scenario 1: No free threads in pool

Error: Cannot schedule a task: no free thread (timeout=0) 
(threads=15000, jobs=15000)

Cause: Thread pool completely saturated; all 15000 threads busy.

Solution:

Scenario 2: Failed to start thread

Error: Cannot schedule a task: failed to start the thread 
(threads=14755, jobs=14754)

Cause: System unable to create new thread (OS or resource limit).

Solution:

# Increase system limits
ulimit -u 65535

# Or in /etc/security/limits.conf
* soft nproc 65535
* hard nproc 65535

# Increase kernel limits
sysctl -w kernel.threads-max=100000
sysctl -w kernel.pid_max=100000

Scenario 3: Cannot allocate thread

Error: Cannot schedule a task: cannot allocate thread

Cause: Memory or system resources insufficient for thread creation.

Solution:

  • Check available memory: free -h
  • Check if system is swapping: vmstat 1
  • Reduce concurrent query load
  • Increase system memory or reduce thread usage

Scenario 4: Insert spike with max_insert_threads

Error: CANNOT_SCHEDULE_TASK during high insert load

Cause: Many concurrent inserts each using high max_insert_threads.

Solution:

-- Reduce insert threads globally
SET max_insert_threads = 4;

-- For specific insert
INSERT INTO table
SETTINGS max_insert_threads = 2;

-- Use async inserts to batch operations
SET async_insert = 1;

Scenario 5: Query spike exhausting thread pool

Error appears during traffic spike
Multiple queries failing simultaneously

Cause: Sudden increase in concurrent queries.

Solution:

  • Implement query queuing or rate limiting on client side
  • Reduce max_threads per query
  • Increase max_thread_pool_size (if system allows)
  • Scale horizontally (add more replicas)

Prevention tips

  1. Set reasonable thread limits: Don't use excessively high max_threads values
  2. Monitor thread usage: Track thread pool metrics regularly
  3. Configure system limits: Ensure OS limits are appropriate for workload
  4. Use async inserts: Reduce thread usage for insert workloads
  5. Implement rate limiting: Control concurrent query load
  6. Scale horizontally: Add replicas to distribute thread demand
  7. Optimize queries: Efficient queries need fewer threads and complete faster

Debugging steps

  1. Check recent CANNOT_SCHEDULE_TASK errors:

    SELECT 
        event_time,
        query_id,
        user,
        exception,
        query
    FROM system.query_log
    WHERE exception_code = 439
        AND event_date >= today() - 1
    ORDER BY event_time DESC
    LIMIT 20;
    
  2. Monitor thread pool metrics:

    SELECT 
        event_time,
        CurrentMetric_GlobalThread AS global_threads,
        CurrentMetric_QueryPipelineExecutorThreads AS executor_threads,
        CurrentMetric_QueryPipelineExecutorThreadsActive AS active_threads,
        CurrentMetric_Query AS concurrent_queries
    FROM system.metric_log
    WHERE event_time >= now() - INTERVAL 1 HOUR
    ORDER BY event_time DESC
    LIMIT 100;
    
  3. Check concurrent query patterns:

    SELECT 
        toStartOfMinute(event_time) AS minute,
        count() AS query_count,
        countIf(exception_code = 439) AS thread_errors,
        avg(ProfileEvents['QueryPipelineExecutorThreads']) AS avg_threads
    FROM system.query_log
    WHERE event_time >= now() - INTERVAL 1 HOUR
    GROUP BY minute
    ORDER BY minute DESC;
    
  4. Identify high thread-consuming queries:

    SELECT 
        query_id,
        user,
        ProfileEvents['QueryPipelineExecutorThreads'] AS threads,
        ProfileEvents['QueryPipelineExecutorThreadsActive'] AS active_threads,
        normalizeQuery(query) AS query_pattern
    FROM system.query_log
    WHERE event_time >= now() - INTERVAL 1 HOUR
        AND type = 'QueryFinish'
    ORDER BY threads DESC
    LIMIT 20;
    
  5. Check system thread limits:

    # Check user process limit
    ulimit -u
    
    # Check current thread count
    ps -eLf | wc -l
    
    # Check system limits
    cat /proc/sys/kernel/threads-max
    cat /proc/sys/kernel/pid_max
    
    # For containers
    cat /sys/fs/cgroup/pids/pids.current
    cat /sys/fs/cgroup/pids/pids.max
    
  6. Review thread pool configuration:

    SELECT 
        name,
        value,
        default
    FROM system.server_settings
    WHERE name IN (
        'max_thread_pool_size',
        'max_thread_pool_free_size',
        'thread_pool_queue_size',
        'concurrent_threads_soft_limit_num',
        'concurrent_threads_soft_limit_ratio_to_cores'
    );
    

Special considerations

For ClickHouse Cloud:

  • Thread pool sized based on instance tier
  • Cannot directly configure max_thread_pool_size
  • Errors may indicate need to scale up instance
  • Temporary spikes should be tolerated with retry logic

Thread pool types:

  • Global thread pool: General query execution threads
  • Background pool: Merges and mutations
  • IO pool: Disk and network I/O operations
  • Schedule pool: Background scheduled tasks

Concurrency control:

Thread vs query limits:

  • max_concurrent_queries limits number of queries
  • max_threads limits threads per query
  • Total threads = queries × threads_per_query
  • Thread pool must accommodate total demand

Thread-related settings

Server-level (config.xml):

<clickhouse>
    <!-- Maximum threads in global pool -->
    <max_thread_pool_size>10000</max_thread_pool_size>
    
    <!-- Idle threads kept in pool -->
    <max_thread_pool_free_size>1000</max_thread_pool_free_size>
    
    <!-- Max tasks waiting for threads -->
    <thread_pool_queue_size>10000</thread_pool_queue_size>
    
    <!-- Soft limit based on cores (if > 0) -->
    <concurrent_threads_soft_limit_ratio_to_cores>2</concurrent_threads_soft_limit_ratio_to_cores>
</clickhouse>

Query-level:

-- Threads for reading/processing
SET max_threads = 8;

-- Threads for parallel inserts
SET max_insert_threads = 4;

-- Threads for distributed queries
SET max_distributed_connections = 1024;

-- Background operations
SET background_pool_size = 16;
SET background_merges_mutations_concurrency_ratio = 2;

System limit configuration

Linux ulimits:

# Temporary increase
ulimit -u 65535

# Permanent configuration in /etc/security/limits.conf
clickhouse soft nproc 65535
clickhouse hard nproc 65535

# Or for all users
* soft nproc 65535
* hard nproc 65535

Kernel parameters:

# Increase thread limits
sysctl -w kernel.threads-max=200000
sysctl -w kernel.pid_max=200000

# Make permanent in /etc/sysctl.conf
kernel.threads-max = 200000
kernel.pid_max = 200000

Container limits (Kubernetes):

# Pod spec - adjust if needed
spec:
  containers:
  - name: clickhouse
    resources:
      limits:
        # Memory affects thread creation
        memory: 32Gi

Error message variations

"no free thread":

  • Thread pool at capacity
  • All threads busy with tasks
  • More common, usually temporary

"failed to start the thread":

  • System failed to create new thread
  • OS or resource limit reached
  • More serious, indicates system issue

"cannot allocate thread":

  • Memory allocation failed for thread
  • System resource exhaustion
  • May indicate memory pressure

Monitoring thread health

-- Real-time thread usage
SELECT 
    metric,
    value,
    description
FROM system.metrics
WHERE metric IN (
    'GlobalThread',
    'GlobalThreadActive',
    'LocalThread',
    'LocalThreadActive',
    'QueryPipelineExecutorThreads',
    'QueryPipelineExecutorThreadsActive'
);

-- Thread usage over time
SELECT 
    toStartOfMinute(event_time) AS minute,
    max(CurrentMetric_GlobalThread) AS max_threads,
    max(CurrentMetric_GlobalThreadActive) AS max_active,
    max(CurrentMetric_Query) AS max_queries
FROM system.metric_log
WHERE event_time >= now() - INTERVAL 1 HOUR
GROUP BY minute
ORDER BY minute DESC;

-- Queries that failed due to thread exhaustion
SELECT 
    toStartOfMinute(event_time) AS minute,
    count() AS error_count,
    count(DISTINCT user) AS affected_users
FROM system.query_log
WHERE exception_code = 439
    AND event_date >= today() - 7
GROUP BY minute
HAVING error_count > 0
ORDER BY minute DESC;

Recovery and mitigation

Immediate actions:

  1. Wait and retry - Thread pool may free up quickly

  2. Kill long-running queries - Free up threads

    -- Find long-running queries
    SELECT query_id, user, elapsed, query
    FROM system.processes
    WHERE elapsed > 300
    ORDER BY elapsed DESC;
    
    -- Kill if appropriate
    KILL QUERY WHERE query_id = 'long_running_query';
    
  3. Reduce query load - Temporarily throttle queries on client side

  4. Restart ClickHouse - Clears thread pool (last resort)

Long-term fixes:

  1. Optimize query thread usage:

    -- Set sensible defaults
    ALTER USER default SETTINGS max_threads = 8;
    ALTER USER default SETTINGS max_insert_threads = 4;
    
  2. Increase thread pool size (if system can handle it):

    <max_thread_pool_size>20000</max_thread_pool_size>
    
  3. Configure concurrency control:

    <!-- Limit threads based on CPU cores -->
    <concurrent_threads_soft_limit_ratio_to_cores>2</concurrent_threads_soft_limit_ratio_to_cores>
    
  4. Increase system limits:

    # Increase user process limit
    ulimit -u 100000
    
    # Increase kernel limits
    sysctl -w kernel.threads-max=200000
    

Prevention tips

  1. Set appropriate max_threads: Don't use default if you have high concurrency
  2. Monitor thread metrics: Track thread pool usage trends
  3. Configure system limits properly: Ensure OS limits match workload
  4. Use async inserts: Reduce thread consumption for insert operations
  5. Implement rate limiting: Control concurrent query load
  6. Test under load: Verify thread pool sizing for peak loads
  7. Keep ClickHouse updated: Concurrency control improvements in newer versions

Known issues and fixes

Issue: Concurrency control broken before October 2024

Issue: High insert threads with concurrent inserts

Issue: Query pipeline executor threads

  • Symptom: QueryPipelineExecutorThreadsActive reaching pool limit
  • Context: Modern query execution uses pipeline executor threads
  • Solution: Proper concurrency control (fixed in 24.10+)

Diagnosing thread pool exhaustion

-- Snapshot of thread usage at error time
WITH error_times AS (
    SELECT DISTINCT toStartOfMinute(event_time) AS error_minute
    FROM system.query_log
    WHERE exception_code = 439
        AND event_time >= now() - INTERVAL 6 HOUR
)
SELECT 
    m.event_time,
    m.CurrentMetric_GlobalThread AS total_threads,
    m.CurrentMetric_GlobalThreadActive AS active_threads,
    m.CurrentMetric_Query AS concurrent_queries,
    m.CurrentMetric_QueryPipelineExecutorThreads AS executor_threads
FROM system.metric_log m
INNER JOIN error_times e ON toStartOfMinute(m.event_time) = e.error_minute
ORDER BY m.event_time;

-- What was running when error occurred
SELECT 
    user,
    count() AS query_count,
    sum(ProfileEvents['QueryPipelineExecutorThreads']) AS total_threads_requested
FROM system.processes
WHERE query_start_time >= 'time_of_error' - INTERVAL 1 MINUTE
    AND query_start_time <= 'time_of_error' + INTERVAL 1 MINUTE
GROUP BY user
ORDER BY total_threads_requested DESC;

For high-concurrency workloads:

-- Per-query thread limits
SET max_threads = 4;  -- Instead of default (CPU cores)
SET max_insert_threads = 4;

-- Enable concurrency control
-- (server config for 24.10+)
concurrent_threads_soft_limit_ratio_to_cores = 2

For analytical workloads:

-- Can use more threads per query
SET max_threads = 16;

-- But limit concurrent queries
SET max_concurrent_queries_for_user = 5;

For mixed workloads:

-- Balance between parallelism and concurrency
SET max_threads = 8;
SET max_insert_threads = 4;
SET max_concurrent_queries_for_user = 20;

When to increase max_thread_pool_size

Consider increasing if:

  • Consistently hitting thread pool limit
  • High concurrency is expected workload pattern
  • System has sufficient resources (CPU, memory)
  • Errors correlate with legitimate traffic spikes

Don't increase if:

  • System already at resource limits
  • Better to reduce per-query thread usage
  • Horizontal scaling is an option
  • Queries can be optimized to use fewer threads

Thread pool sizing guidelines

Recommended max_thread_pool_size calculation:
= (concurrent_queries × max_threads_per_query) × 1.5 safety margin

Example:
- Expected concurrent queries: 50
- Average max_threads: 8
- Calculation: 50 × 8 × 1.5 = 600 threads

But also consider:
- System CPU cores (more threads than cores causes context switching)
- Available memory (each thread has stack, typically 8-10 MB)
- Background operations (merges, mutations need threads too)

Temporary workarounds

While waiting for long-term fixes:

-- Reduce thread usage across all queries
ALTER SETTINGS PROFILE default SET max_threads = 4;

-- Prioritize critical queries
SELECT * FROM important_table
SETTINGS priority = 1;  -- Higher priority

-- For non-critical queries
SELECT * FROM less_important_table
SETTINGS priority = 10,  -- Lower priority
         max_threads = 2;  -- Fewer threads

For ClickHouse Cloud users

Limitations:

  • Cannot directly configure max_thread_pool_size
  • Thread pool sized by instance tier
  • Need to upgrade tier if consistently hitting limits

Recommendations:

  • Set appropriate max_threads and max_insert_threads
  • Monitor thread usage metrics
  • Scale up tier if thread exhaustion is frequent
  • Implement retry logic for transient errors
  • Consider horizontal scaling (more replicas)

Escalation:

  • If errors persist after optimization
  • If thread pool appears undersized for tier
  • Contact support with thread usage metrics

If you're experiencing this error:

  1. Check if this is a transient spike (retry may succeed)
  2. Review current thread pool usage in system.metrics
  3. Check for traffic spike or abnormal query patterns
  4. Verify system thread limits are adequate
  5. Reduce max_threads and max_insert_threads if set too high
  6. Monitor for queries using excessive threads
  7. For persistent issues, increase max_thread_pool_size (self-managed) or scale up (Cloud)
  8. Ensure concurrency control is working (upgrade to 24.10+ if needed)
  9. Implement client-side retry with exponential backoff

Related documentation: