有一个大数据库，有1,000,000,000行，称为线程（这些线程实际上存在，我不会因为喜欢它而使事情变得更难）。线程中只有一些东西可以使事情变得更快：（整数id，字符串哈希，整数回复计数，整数日期线（时间戳），整数论坛ID，字符串标题）

查询：

select * from thread where forumid = 100 and replycount > 1 order by dateline desc limit 10000, 100

由于有1G的记录，因此查询速度很慢。所以我想，让我们将这1G记录拆分成与我拥有的多个论坛（类别）一样多的表！那几乎是完美的。由于有很多表，因此我搜索的记录较少，而且速度更快。现在查询变为：

select * from thread_{forum_id} where replycount > 1 order by dateline desc limit 10000, 100

在99％的论坛（类别）中，这确实更快，因为大多数论坛只有少数主题（100k-1M）。但是，由于有些记录大约有10M，因此某些查询仍然很慢（0.1 / .2秒，对于我的应用程序来说太长了！，我已经在使用索引！）。

我不知道如何使用MySQL来改善这一点。有办法吗？

对于这个项目，我将使用10台服务器（12GB内存，软件RAID 10上的4x7200rpm硬盘，四核）

这个想法只是在服务器之间拆分数据库，但是由于上面说明的问题仍然不够。

如果我在这10台服务器上安装了cassandra（假设我有时间让它按预期的方式工作），是否应该提高性能？

我该怎么办？继续在多台机器上使用MySQL和分布式数据库，还是构建一个Cassandra集群？

我被要求发布什么索引，它们是：

mysql> show index in thread; PRIMARY id forumid dateline replycount 选择说明：

您应该阅读以下内容，并学习一些有关精心设计的innodb表的优点以及如何最好地使用聚集索引的信息-只有innodb才有！

http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/

然后根据以下简化示例设计系统：

模式示例（简化）重要的功能是表使用innodb引擎，并且线程表的主键不再是单个auto_incrementing键，而是基于forum_id和thread_id组合的复合集群键。例如

threads - primary key (forum_id, thread_id)

forum_id thread_id ======== ========= 1 1 1 2 1 3 1 ... 1 2058300
2 1 2 2 2 3 2 ... 2 2352141 ... 每个论坛行都包含一个称为next_thread_id（无符号int）的计数器，该计数器由触发器维护，并在每次将线程添加到给定论坛时递增。这也意味着如果对thread_id使用单个auto_increment主键，则每个论坛可以存储40亿个线程，而不是总共40亿个线程。

forum_id title next_thread_id ======== ===== ============== 1 forum 1 2058300 2 forum 2 2352141 3 forum 3 2482805 4 forum 4 3740957 ... 64 forum 64 3243097 65 forum 65 15000000 -- ooh a big one 66 forum 66 5038900 67 forum 67 4449764 ... 247 forum 247 0 -- still loading data for half the forums ! 248 forum 248 0 249 forum 249 0 250 forum 250 0 使用组合键的缺点是您不能再按以下单个键值选择线程：

select * from threads where thread_id = y; 你所要做的：

select * from threads where forum_id = x and thread_id = y; 但是，您的应用程序代码应知道用户正在浏览哪个论坛，因此实施起来并非难事-将当前查看的forum_id存储在会话变量或隐藏的表单字段等中。

这是简化的架构：

drop table if exists forums; create table forums ( forum_id smallint unsigned not null auto_increment primary key, title varchar(255) unique not null, next_thread_id int unsigned not null default 0 -- count of threads in each forum )engine=innodb;

drop table if exists threads; create table threads ( forum_id smallint unsigned not null, thread_id int unsigned not null default 0, reply_count int unsigned not null default 0, hash char(32) not null, created_date datetime not null, primary key (forum_id, thread_id, reply_count) -- composite clustered index )engine=innodb;

delimiter #

create trigger threads_before_ins_trig before insert on threads for each row begin declare v_id int unsigned default 0;

select next_thread_id + 1 into v_id from forums where forum_id = new.forum_id; set new.thread_id = v_id; update forums set next_thread_id = v_id where forum_id = new.forum_id; end#

delimiter ; 您可能已经注意到，我已经将reply_count包含在主键中，这有点奇怪，因为（forum_id，thread_id）复合物本身是唯一的。这只是一个索引优化，在执行使用reply_count的查询时可以节省一些I / O。请参阅上面的2个链接，以获取有关此信息的更多信息。

查询示例我仍在将数据加载到示例表中，到目前为止，我已加载了大约。5亿行（是系统的一半）。加载过程完成后，我应该期望大约：

250 forums * 5 million threads = 1250 000 000 (1.2 billion rows) 我故意使某些论坛包含超过500万个线程，例如，论坛65具有1500万个线程：

forum_id title next_thread_id ======== ===== ============== 65 forum 65 15000000 -- ooh a big one 查询运行时 select sum(next_thread_id) from forums;

sum(next_thread_id)

539,155,433 (500 million threads so far and still growing...) 在innodb下，对next_thread_ids求和以得出总线程数要比通常快得多：

select count(*) from threads; 论坛65有多少个线程：

select next_thread_id from forums where forum_id = 65

next_thread_id

15,000,000 (15 million) 再次，这比平常更快：

select count(*) from threads where forum_id = 65 好的，现在我们知道到目前为止，我们大约有5亿个线程，而论坛65上有1500万个线程-让我们看看模式是如何执行的:)

select forum_id, thread_id from threads where forum_id = 65 and reply_count > 64 order by thread_id desc limit 32;

runtime = 0.022 secs

select forum_id, thread_id from threads where forum_id = 65 and reply_count > 1 order by thread_id desc limit 10000, 100;

runtime = 0.027 secs 在我看来，性能非常好-因此，这是一个具有500+百万行（并且正在不断增长）的单表，并且该查询在0.02秒（负载下！）下覆盖了1500万行

进一步优化其中包括：

按范围划分

分片

投入金钱和硬件

等等...

希望这个答案对您有帮助:)

MySQL和NoSQL：帮助我选择合适的一个

sum(next_thread_id)

next_thread_id

相关文章

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

MySQL和NoSQL：帮助我选择合适的一个

sum(next_thread_id)

next_thread_id

相关文章