在某網站中使用的字幕文件名猜測視頻文件名的算法 ; 論野生技術&二次元

覺得有點意思，拿出來和大家一起研究下，歡迎提供更好的建議。

因為爬x手的時候，網頁已經沒了，只能自食其力；用unrar(rarlab上下的，apt-get里的太古老)獲得rar內的文件名，用zipfile模塊列zip包的，抓的時候直接把返回的內容插到數據庫了，沒分析，因為沒那麼多時間去想算法，還是先把dirty data擼下來再說。

然後開始正文了www

按照這麼一個基本思想，字幕文件名除了擴展名以外，其餘部分和視頻文件是一樣的。如果壓縮包裡面只有一個文件，那麼直接就是它去掉擴展名就好了；但是如果有多個版本的字幕（比如eng，GB，BIG5等），那就需要一個字符串最大匹配的算法。←為了裝B取的名字

我是這麼想的，首先需要一個最小單位來比較，不然一個一個字匹配，加上選擇排列的時間複雜度，估計要跪；所以要減少最小單位的個數。因為大部分文件名用空格、“-”、“_”、“][”（二次元一般比較喜歡用中括號）。找一個能把文件名切割成最多的快的分隔符出來：

splt = '.-_ ]'
m_splt = max(splt, key = lambda x:sum(map(lambda l:len(l.split(x)), lst)))

1 2	splt = '.-_ ]' m_splt = max(splt, key = lambda x:sum(map(lambda l:len(l.split(x)), lst)))

分得越多當然就匹配的粒度更細嘛。

分割完之後，壓縮包里的各個文件名都變成了一個個列表

然後就是每個單位做一個排列組合，如果有超過閾值的文件的某一個單位相同，則認為這是共同部分

def getEqual(l):
    cnt = len(l)
    equals = {}
    for i, j in itertools.combinations(l, 2):
        if not i or not j:
            continue
        if i == j:
            pass
        elif i.upper() == j.upper():
            i = i.upper()
        elif getCapital(i) == j or getCapital(j) == i:
            i = getCapital(i)
        else:
            continue
        #else i == j
        if i not in equals:
            equals[i] = 1
        else:
            equals[i] += 1
    if not equals:
        #print 'end'
        return False, ''
    m = max(equals.iteritems(), key = lambda x:x[1])
    _comb = cnt * (cnt -1) /2
    #print '***', m[1], _comb, m[1] > 0.3 * _comb
    if m[1] > 0.3 * _comb or m[1] == _comb:
        return True, m[0]
    else:
        return False, ''

def getEqual(l):

cnt = len(l)

equals = {}

for i, j in itertools.combinations(l, 2):

if not i or not j:

continue

if i == j:

pass

elif i.upper() == j.upper():

i = i.upper()

elif getCapital(i) == j or getCapital(j) == i:

i = getCapital(i)

else:

continue

#else i == j

if i not in equals:

equals[i] = 1

else:

equals[i] += 1

if not equals:

#print 'end'

return False, ''

m = max(equals.iteritems(), key = lambda x:x[1])

_comb = cnt * (cnt -1) /2

#print '***', m[1], _comb, m[1] > 0.3 * _comb

if m[1] > 0.3 * _comb or m[1] == _comb:

return True, m[0]

else:

return False, ''

getCaptital就是把第一個變成大寫的函數。這是考慮到有些魂淡一會首字母大寫一會首字母不大寫造成的。不直接全部轉小寫再比較，是因為要盡量保持文件名的原始性，比如有些就是小寫字母開頭的名字，那不就坑爹了。

這裡設置的閾值是30%的排列項一樣就認為這個單位是共同部分。你覺得很低嘛，其實不低的呀，你想要是逗比字幕組在裡面放一個招人.srt那不是傻掉了。

當然也要過濾擴展名。

啊呀好麻煩我不寫了你們看代碼吧

def getCommon(ori_lst, splt = '.-_ ]', with_no_digit = False):
    if with_no_digit:#replace off all digits, must be second time, so we don't strip ext name any more
        lst = map(lambda x:re.sub('\d+', '', x), ori_lst)
    else:
        lst = ['.'.join(x.split('.')[:-1]) for x in ori_lst if x and x[-4:] not in ('.txt', '.jpg', '.gif')]#strip ext name
    if len(lst) == 1:
        return lst[0]
    # judge which splitter gets most split
    m_splt = max(splt, key = lambda x:sum(map(lambda l:len(l.split(x)), lst)))
    def getEqual(l):
        cnt = len(l)
        equals = {}
        for i, j in itertools.combinations(l, 2):
            if not i or not j:
                continue
            if i == j:
                pass
            elif i.upper() == j.upper():
                i = i.upper()
            elif getCapital(i) == j or getCapital(j) == i:
                i = getCapital(i)
            else:
                continue
            #else i == j
            if i not in equals:
                equals[i] = 1
            else:
                equals[i] += 1
        if not equals:
            #print 'end'
            return False, ''
        m = max(equals.iteritems(), key = lambda x:x[1])
        _comb = cnt * (cnt -1) /2
        #print '***', m[1], _comb, m[1] > 0.3 * _comb
        if m[1] > 0.3 * _comb or m[1] == _comb:
            return True, m[0]
        else:
            return False, ''
    m_lst = map(lambda l:l.split(m_splt), lst)
    if not m_lst:
        return ''
    #print m_lst
    m_pattern = []
    for p in map(None, *m_lst):#add None to fillup short ones
        suc, new_pattern = getEqual(p)
        #print(suc, new_pattern)
        if suc:
            #print('new', new_pattern)
            m_pattern.append(new_pattern)
        else:
            break
    ret = m_splt.join(m_pattern) + (']' if m_splt == ']' else '')
    if not ret and not with_no_digit:#let's try strings without digits to get rid of "season" and "episode" difference
        return getCommon(lst, with_no_digit = True)#we pass prepared lst instead of ori_lst
    else:
        return ret

def getCommon(ori_lst, splt = '.-_ ]', with_no_digit = False):

if with_no_digit:#replace off all digits, must be second time, so we don't strip ext name any more

lst = map(lambda x:re.sub('\d+', '', x), ori_lst)

else:

lst = ['.'.join(x.split('.')[:-1]) for x in ori_lst if x and x[-4:] not in ('.txt', '.jpg', '.gif')]#strip ext name

if len(lst) == 1:

return lst[0]

# judge which splitter gets most split

m_splt = max(splt, key = lambda x:sum(map(lambda l:len(l.split(x)), lst)))

def getEqual(l):

cnt = len(l)

equals = {}

for i, j in itertools.combinations(l, 2):

if not i or not j:

continue

if i == j:

pass

elif i.upper() == j.upper():

i = i.upper()

elif getCapital(i) == j or getCapital(j) == i:

i = getCapital(i)

else:

continue

#else i == j

if i not in equals:

equals[i] = 1

else:

equals[i] += 1

if not equals:

#print 'end'

return False, ''

m = max(equals.iteritems(), key = lambda x:x[1])

_comb = cnt * (cnt -1) /2

#print '***', m[1], _comb, m[1] > 0.3 * _comb

if m[1] > 0.3 * _comb or m[1] == _comb:

return True, m[0]

else:

return False, ''

m_lst = map(lambda l:l.split(m_splt), lst)

if not m_lst:

return ''

#print m_lst

m_pattern = []

for p in map(None, *m_lst):#add None to fillup short ones

suc, new_pattern = getEqual(p)

#print(suc, new_pattern)

if suc:

#print('new', new_pattern)

m_pattern.append(new_pattern)

else:

break

ret = m_splt.join(m_pattern) + (']' if m_splt == ']' else '')

if not ret and not with_no_digit:#let's try strings without digits to get rid of "season" and "episode" difference

return getCommon(lst, with_no_digit = True)#we pass prepared lst instead of ori_lst

else:

return ret