Самый быстрый код Python, чтобы найти набор выигрышных слов в этой игре

Это игра в слова из набора игровых карточек для детей. Ниже правил приведен код для поиска лучшего триплета с использованием / usr / share / dict / words. Я подумал, что это интересная проблема оптимизации, и мне интересно, могут ли люди найти улучшения.

правила

Выберите одну букву из каждого набора ниже.
Выберите слово, используя выбранные буквы (и любые другие).
Оценка слова.
- Каждая буква из выбранного набора получает номер, показанный вместе с набором (повторы включены).
- AEIOU считать 0
- Все остальные буквы -2
Повторите шаги 1-3 выше (без повторного использования букв на шаге 1) еще дважды.
Окончательный результат - это сумма трех слов.

наборы

(установить 1 балл 1 балл, установить 2 балла 2 балла и т. д.)

Код:

from itertools import permutations
import numpy as np

points = {'LTN' : 1,
          'RDS' : 2,
          'GBM' : 3,
          'CHP' : 4,
          'FWV' : 5,
          'YKJ' : 6,
          'QXZ' : 7}

def tonum(word):
    word_array = np.zeros(26, dtype=np.int)
    for l in word:
        word_array[ord(l) - ord('A')] += 1
    return word_array.reshape((26, 1))

def to_score_array(letters):
    score_array = np.zeros(26, dtype=np.int) - 2
    for v in 'AEIOU':
        score_array[ord(v) - ord('A')] = 0
    for idx, l in enumerate(letters):
        score_array[ord(l) - ord('A')] = idx + 1
    return np.matrix(score_array.reshape(1, 26))

def find_best_words():
    wlist = [l.strip().upper() for l in open('/usr/share/dict/words') if l[0].lower() == l[0]]
    wlist = [l for l in wlist if len(l) > 4]
    orig = [l for l in wlist]
    for rep in 'AEIOU':
        wlist = [l.replace(rep, '') for l in wlist]
    wlist = np.hstack([tonum(w) for w in wlist])

    best = 0
    ct = 0
    bestwords = ()
    for c1 in ['LTN']:
        for c2 in permutations('RDS'):
            for c3 in permutations('GBM'):
                for c4 in permutations('CHP'):
                    for c5 in permutations('FWV'):
                        for c6 in permutations('YJK'):
                            for c7 in permutations('QZX'):
                                vals = [to_score_array(''.join(s)) for s in zip(c1, c2, c3, c4, c5, c6, c7)]
                                ct += 1
                                print ct, 6**6
                                scores1 = (vals[0] * wlist).A.flatten()
                                scores2 = (vals[1] * wlist).A.flatten()
                                scores3 = (vals[2] * wlist).A.flatten()
                                m1 = max(scores1)
                                m2 = max(scores2)
                                m3 = max(scores3)
                                if m1 + m2 + m3 > best:
                                    print orig[scores1.argmax()], orig[scores2.argmax()], orig[scores3.argmax()], m1 + m2 + m3
                                    best = m1 + m2 + m3
                                    bestwords = (orig[scores1.argmax()], orig[scores2.argmax()], orig[scores3.argmax()])
    return bestwords, best


if __name__ == '__main__':
    import timeit
    print timeit.timeit('print find_best_words()', 'from __main__ import find_best_words', number=1)

Матричная версия - это то, что я придумал после написания одной на чистом Python (используя словари и оценивая каждое слово независимо), а другой - на numpy, но используя индексирование, а не матричное умножение.

Следующая оптимизация будет состоять в том, чтобы полностью удалить гласные из оценки (и использовать модифицированную ord()функцию), но мне интересно, есть ли еще более быстрые подходы.

РЕДАКТИРОВАТЬ : добавлен код timeit.timeit

РЕДАКТИРОВАТЬ : я добавляю вознаграждение, которое я дам для любого улучшения, которое мне больше всего нравится (или, возможно, несколько ответов, но мне придется набрать еще немного репутации, если это так).

fastest-code python optimization thouis
источник

Кстати, я написал код, чтобы дать моим восьмилетним детям три слова для запоминания, когда он играл в игру против своей матери. Теперь я знаю, что значит ксилопирография.

Это забавный вопрос. Я думаю, вы могли бы повысить вероятность получения ответов, если предоставите следующее: (1) Ссылка на онлайн-список слов, чтобы все работали с одним и тем же набором данных. (2) Поместите свое решение в одну функцию. (3) Запустите эту функцию, используя модуль time-it, чтобы показать время. (4) Убедитесь, что загрузка словарных данных находится за пределами функции, чтобы мы не проверяли скорость диска. Затем люди могут использовать существующий код в качестве основы для сравнения своих решений.

Я перепишу, чтобы использовать timeit, но для честных сравнений мне пришлось бы использовать свой собственный компьютер (что я с удовольствием сделаю для людей, публикующих решения). Список слов должен быть доступен на большинстве систем, но если нет, есть несколько здесь: wordlist.sourceforge.net

Справедливые сравнения можно провести, если каждый пользователь разместит ваше решение и любые другие опубликованные решения на своем компьютере. Будут некоторые отличия кроссплатформенности, но в целом этот метод работает.

Хм, в таком случае мне интересно, правильный ли это сайт. Я думаю, что так было бы лучше всего подходит.

Джои

Ответы:

Используя идею Кейта о предварительном вычислении наилучшего результата для каждого слова, мне удалось сократить время выполнения на моем компьютере примерно до 0,7 секунды (используя список из 75 288 слов).

Хитрость заключается в том, чтобы проходить комбинации слов, которые нужно сыграть, вместо всех выбранных комбинаций букв. Мы можем игнорировать все, кроме нескольких комбинаций слов (203, используя мой список слов), потому что они не могут получить более высокий балл, чем мы уже нашли. Почти все время выполнения тратится на предварительные вычисления слов.

Python 2.7:

import collections
import itertools


WORDS_SOURCE = '../word lists/wordsinf.txt'

WORDS_PER_ROUND = 3
LETTER_GROUP_STRS = ['LTN', 'RDS', 'GBM', 'CHP', 'FWV', 'YKJ', 'QXZ']
LETTER_GROUPS = [list(group) for group in LETTER_GROUP_STRS]
GROUP_POINTS = [(group, i+1) for i, group in enumerate(LETTER_GROUPS)]
POINTS_IF_NOT_CHOSEN = -2


def best_word_score(word):
    """Return the best possible score for a given word."""

    word_score = 0

    # Score the letters that are in groups, chosing the best letter for each
    # group of letters.
    total_not_chosen = 0
    for group, points_if_chosen in GROUP_POINTS:
        letter_counts_sum = 0
        max_letter_count = 0
        for letter in group:
            if letter in word:
                count = word.count(letter)
                letter_counts_sum += count
                if count > max_letter_count:
                    max_letter_count = count
        if letter_counts_sum:
            word_score += points_if_chosen * max_letter_count
            total_not_chosen += letter_counts_sum - max_letter_count
    word_score += POINTS_IF_NOT_CHOSEN * total_not_chosen

    return word_score

def best_total_score(words):
    """Return the best score possible for a given list of words.

    It is fine if the number of words provided is not WORDS_PER_ROUND. Only the
    words provided are scored."""

    num_words = len(words)
    total_score = 0

    # Score the letters that are in groups, chosing the best permutation of
    # letters for each group of letters.
    total_not_chosen = 0
    for group, points_if_chosen in GROUP_POINTS:
        letter_counts = []
        # Structure:  letter_counts[word_index][letter] = count
        letter_counts_sum = 0
        for word in words:
            this_word_letter_counts = {}
            for letter in group:
                count = word.count(letter)
                this_word_letter_counts[letter] = count
                letter_counts_sum += count
            letter_counts.append(this_word_letter_counts)

        max_chosen = None
        for letters in itertools.permutations(group, num_words):
            num_chosen = 0
            for word_index, letter in enumerate(letters):
                num_chosen += letter_counts[word_index][letter]
            if num_chosen > max_chosen:
                max_chosen = num_chosen

        total_score += points_if_chosen * max_chosen
        total_not_chosen += letter_counts_sum - max_chosen
    total_score += POINTS_IF_NOT_CHOSEN * total_not_chosen

    return total_score


def get_words():
    """Return the list of valid words."""
    with open(WORDS_SOURCE, 'r') as source:
        return [line.rstrip().upper() for line in source]

def get_words_by_score():
    """Return a dictionary mapping each score to a list of words.

    The key is the best possible score for each word in the corresponding
    list."""

    words = get_words()
    words_by_score = collections.defaultdict(list)
    for word in words:
        words_by_score[best_word_score(word)].append(word)
    return words_by_score


def get_winning_words():
    """Return a list of words for an optimal play."""

    # A word's position is a tuple of its score's index and the index of the
    # word within the list of words with this score.
    # 
    # word played: A word in the context of a combination of words to be played
    # word chosen: A word in the context of the list it was picked from

    words_by_score = get_words_by_score()
    num_word_scores = len(words_by_score)
    word_scores = sorted(words_by_score, reverse=True)
    words_by_position = []
    # Structure:  words_by_position[score_index][word_index] = word
    num_words_for_scores = []
    for score in word_scores:
        words = words_by_score[score]
        words_by_position.append(words)
        num_words_for_scores.append(len(words))

    # Go through the combinations of words in lexicographic order by word
    # position to find the best combination.
    best_score = None
    positions = [(0, 0)] * WORDS_PER_ROUND
    words = [words_by_position[0][0]] * WORDS_PER_ROUND
    scores_before_words = []
    for i in xrange(WORDS_PER_ROUND):
        scores_before_words.append(best_total_score(words[:i]))
    while True:
        # Keep track of the best possible combination of words so far.
        score = best_total_score(words)
        if score > best_score:
            best_score = score
            best_words = words[:]

        # Go to the next combination of words that could get a new best score.
        for word_played_index in reversed(xrange(WORDS_PER_ROUND)):
            # Go to the next valid word position.
            score_index, word_chosen_index = positions[word_played_index]
            word_chosen_index += 1
            if word_chosen_index == num_words_for_scores[score_index]:
                score_index += 1
                if score_index == num_word_scores:
                    continue
                word_chosen_index = 0

            # Check whether the new combination of words could possibly get a
            # new best score.
            num_words_changed = WORDS_PER_ROUND - word_played_index
            score_before_this_word = scores_before_words[word_played_index]
            further_points_limit = word_scores[score_index] * num_words_changed
            score_limit = score_before_this_word + further_points_limit
            if score_limit <= best_score:
                continue

            # Update to the new combination of words.
            position = score_index, word_chosen_index
            positions[word_played_index:] = [position] * num_words_changed
            word = words_by_position[score_index][word_chosen_index]
            words[word_played_index:] = [word] * num_words_changed
            for i in xrange(word_played_index+1, WORDS_PER_ROUND):
                scores_before_words[i] = best_total_score(words[:i])
            break
        else:
            # None of the remaining combinations of words can get a new best
            # score.
            break

    return best_words


def main():
    winning_words = get_winning_words()
    print winning_words
    print best_total_score(winning_words)

if __name__ == '__main__':
    main()

Это возвращает решение ['KNICKKNACK', 'RAZZMATAZZ', 'POLYSYLLABLES']со счетом 95. Со словами из решения Кейта, добавленными в список слов, я получаю тот же результат, что и он. С добавленной вами «ксилопирографией» я получил ['XYLOPYROGRAPHY', 'KNICKKNACKS', 'RAZZMATAZZ']105 баллов.

flornquake
источник

Вот идея - вы можете избежать проверки большого количества слов, заметив, что большинство слов имеют ужасные оценки. Скажем, вы нашли довольно хорошую игру, которая приносит вам 50 очков. Тогда любая игра с более чем 50 очками должна иметь слово не менее ceil (51/3) = 17 очков. Поэтому любое слово, которое не может дать 17 очков, можно игнорировать.

Вот код, который делает выше. Мы вычисляем наилучшую возможную оценку для каждого слова в словаре и сохраняем его в массиве, индексированном по оценке. Затем мы используем этот массив для проверки только тех слов, которые имеют требуемый минимальный балл.

from itertools import permutations
import time

S={'A':0,'E':0,'I':0,'O':0,'U':0,
   'L':1,'T':1,'N':1,
   'R':2,'D':2,'S':2,
   'G':3,'B':3,'M':3,
   'C':4,'H':4,'P':4,
   'F':5,'W':5,'V':5,
   'Y':6,'K':6,'J':6,
   'Q':7,'X':7,'Z':7,
   }

def best_word(min, s):
    global score_to_words
    best_score = 0
    best_word = ''
    for i in xrange(min, 100):
        for w in score_to_words[i]:
            score = (-2*len(w)+2*(w.count('A')+w.count('E')+w.count('I')+w.count('O')+w.count('U')) +
                      3*w.count(s[0])+4*w.count(s[1])+5*w.count(s[2])+6*w.count(s[3])+7*w.count(s[4])+
                      8*w.count(s[5])+9*w.count(s[6]))
            if score > best_score:
                best_score = score
                best_word = w
    return (best_score, best_word)

def load_words():
    global score_to_words
    wlist = [l.strip().upper() for l in open('/usr/share/dict/words') if l[0].lower() == l[0]]
    score_to_words = [[] for i in xrange(100)]
    for w in wlist: score_to_words[sum(S[c] for c in w)].append(w)
    for i in xrange(100):
        if score_to_words[i]: print i, len(score_to_words[i])

def find_best_words():
    load_words()
    best = 0
    bestwords = ()
    for c1 in permutations('LTN'):
        for c2 in permutations('RDS'):
            for c3 in permutations('GBM'):
            print time.ctime(),c1,c2,c3
                for c4 in permutations('CHP'):
                    for c5 in permutations('FWV'):
                        for c6 in permutations('YJK'):
                            for c7 in permutations('QZX'):
                                sets = zip(c1, c2, c3, c4, c5, c6, c7)
                                (s1, w1) = best_word((best + 3) / 3, sets[0])
                                (s2, w2) = best_word((best - s1 + 2) / 2, sets[1])
                                (s3, w3) = best_word(best - s1 - s2 + 1, sets[2])
                                score = s1 + s2 + s3
                                if score > best:
                                    best = score
                                    bestwords = (w1, w2, w3)
                                    print score, w1, w2, w3
    return bestwords, best


if __name__ == '__main__':
    import timeit
    print timeit.timeit('print find_best_words()', 'from __main__ import find_best_words', number=1)

Минимальный балл быстро увеличивается до 100, что означает, что нам нужно учитывать только 33+ балльных слова, что является очень малой долей от общего количества (у меня /usr/share/dict/words208662 правильных слов, только 1723 из которых 33+ баллов = 0,8%). Запускается примерно на полчаса на моей машине и генерирует:

(('MAXILLOPREMAXILLARY', 'KNICKKNACKED', 'ZIGZAGWISE'), 101)

Кит Рэндалл
источник

Ницца. Я добавлю это к матричному решению (удаляя слова, так как их оценка падает слишком низко), но это значительно лучше, чем любое из чисто Python-решений, которое я придумал.

thouis

Я не уверен, что когда-либо видел столько гнезд для циклов.

Питер Олсон

Объединение вашей идеи с оценкой матрицы (и более жестким верхним пределом для наилучшего возможного результата) сокращает время на моей машине примерно до 80 секунд (примерно с часа). код здесь

thouis

Хорошая часть этого времени находится в предварительном вычислении лучших возможных результатов, которые можно было бы сделать намного быстрее.

thouis